GithubHelp home page GithubHelp logo

rambatino / chaid Goto Github PK

View Code? Open in Web Editor NEW
149.0 14.0 50.0 5.37 MB

A python implementation of the common CHAID algorithm

License: Apache License 2.0

Python 100.00%
tree spss chaid marketing-statistics stats

chaid's Introduction

Codecov

Chi-Squared Automatic Inference Detection

This package provides a python implementation of the Chi-Squared Automatic Inference Detection (CHAID) decision tree as well as exhaustive CHAID

Installation

CHAID is distributed via pypi and can be installed like:

pip3 install CHAID

If you need support for graphs, optional packages must be installed together like:

pip install CHAID[graph]

If you need support to read in a .sav file (SPSS), you will also need to install optional packages like:

pip install CHAID[spss]

To install multiple optional packages, you can use a comma-separated list like:

pip install CHAID[graph,spss]

Alternatively, you can clone the repository and install via

pip install -e path/to/your/checkout

N.B. although we've made some attempt at supporting python 2.7 see here, we don't encourage the use of it as it's reached it's End Of Life (EOL).

Creating a CHAID Tree

from CHAID import Tree, NominalColumn
import pandas as pd
import numpy as np


## create the data
ndarr = np.array(([1, 2, 3] * 5) + ([2, 2, 3] * 5)).reshape(10, 3)
df = pd.DataFrame(ndarr)
df.columns = ['a', 'b', 'c']
arr = np.array(([1] * 5) + ([2] * 5))
df['d'] = arr

>>> df
   a  b  c  d
0  1  2  3  1
1  1  2  3  1
2  1  2  3  1
3  1  2  3  1
4  1  2  3  1
5  2  2  3  2
6  2  2  3  2
7  2  2  3  2
8  2  2  3  2
9  2  2  3  2

## set the CHAID input parameters
independent_variable_columns = ['a', 'b', 'c']
dep_variable = 'd'

## create the Tree via pandas
tree = Tree.from_pandas_df(df, dict(zip(independent_variable_columns, ['nominal'] * 3)), dep_variable)
## create the same tree, but without pandas helper
tree = Tree.from_numpy(ndarr, arr, split_titles=['a', 'b', 'c'], min_child_node_size=5)
## create the same tree using the tree constructor
cols = [
  NominalColumn(ndarr[:,0], name='a'),
  NominalColumn(ndarr[:,1], name='b'),
  NominalColumn(ndarr[:,2], name='c')
]
tree = Tree(cols, NominalColumn(arr, name='d'), {'min_child_node_size': 5})

>>> tree.print_tree()
([], {1: 5, 2: 5}, ('a', p=0.001565402258, score=10.0, groups=[[1], [2]]), dof=1))
├── ([1], {1: 5, 2: 0}, <Invalid Chaid Split>)
└── ([2], {1: 0, 2: 5}, <Invalid Chaid Split>)

## to get a LibTree object,
>>> tree.to_tree()
<treelib.tree.Tree object at 0x114e2e350>

## the different nodes of the tree can be accessed like
first_node = tree.tree_store[0]

>>> first_node
([], {1: 5, 2: 5}, ('a', p=0.001565402258, score=10.0, groups=[[1], [2]]), dof=1))

## the properties of the node can be access like
>>> first_node.members
{1: 5, 2: 5}

## the properties of split can be accessed like
>>> first_node.split.p
0.001565402258002549
>>> first_node.split.score
10.0

Creating a Tree using Bartlett's or Levene's Significance Test for Continuous Variables

When the dependent variable is continuous, the chi-squared test does not work due to very low frequencies of values across subgroups. As a consequence, and because the F-test is very susceptible to deviations from normality, the normality of the dependent set is determined and Bartlett's test for significance is used when the data is normally distributed (although the subgroups may not necessarily be so) or Levene's test is used when the data is non-normal.

from CHAID import Tree

## create the data
ndarr = np.array(([1, 2, 3] * 5) + ([2, 2, 3] * 5)).reshape(10, 3)
df = pd.DataFrame(ndarr)
df.columns = ['a', 'b', 'c']
df['d'] = np.random.normal(300, 100, 10)
independent_variable_columns = ['a', 'b', 'c']
dep_variable = 'd'

>>> df
   a  b  c           d
0  1  2  3  262.816747
1  1  2  3  240.139085
2  1  2  3  204.224083
3  1  2  3  231.024752
4  1  2  3  263.176338
5  2  2  3  440.371621
6  2  2  3  221.762452
7  2  2  3  197.290268
8  2  2  3  275.925549
9  2  2  3  238.471850

## create the Tree via pandas
tree = Tree.from_pandas_df(df, dict(zip(independent_variable_columns, ['nominal'] * 3)), dep_variable, dep_variable_type='continuous')

## print the tree (though not enough power to split)
>>> tree.print_tree()
([], {'s.t.d': 86.562258585515579, 'mean': 297.52027436303212}, <Invalid Chaid Split>)

Parameters

  • df: Pandas DataFrame
  • i_variables: Dict<string, string>: Independent variable column names as keys and the type as the values (nominal or ordinal)
  • d_variable: String: Dependent variable column name
  • opts: {}:
    • alpha_merge: Float (default = 0.05): If the respective test for a given pair of predictor categories is not statistically significant as defined by an alpha_merge value, the least significant predictor categories are merged and the splitting of the node is attempted with the newly formed categories
    • max_depth: Integer (default = 2): The maximum depth of the tree
    • min_parent_node_size: Float (default = 30): The minimum number of respondents required for a split to occur on a particular node
    • min_child_node_size: Float (default = 0): If the split of a node results in a child node whose node size is less than min_child_node_size, child nodes that have too few cases (as with this minimum) will merge with the most similar child node as measured by the largest of the p-values. However, if the resulting number of child nodes is 1, the node will not be split.
    • max_splits: Integer or None (default = None): If specified, child nodes will continue to be merged until the number of splits at a single node is at max equal to max_splits. If not specified, this will be ignored.
    • split_threshold: Float (default = 0): The split threshold when bucketing root node surrogate splits
    • weight: String (default = None): The name of the weight column
    • dep_variable_type (default = categorical, other_options = continuous): Whether the dependent variable is 'categorical' or 'continuous' Running from the Command Line

You can play around with the repo by cloning and running this from the command line:

python -m CHAID tests/data/titanic.csv survived sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05

It calls the print_tree() method, which prints the tree to terminal:

([], {0: 809, 1: 500}, (sex, p=1.47145310169e-81, chi=365.886947811, groups=[['female'], ['male']]))
├── (['female'], {0: 127, 1: 339}, (embarked, p=9.17624191599e-07, chi=24.0936494474, groups=[['C', '<missing>'], ['Q', 'S']]))
│   ├── (['C', '<missing>'], {0: 11, 1: 104}, <Invalid Chaid Split>)
│   └── (['Q', 'S'], {0: 116, 1: 235}, <Invalid Chaid Split>)
└── (['male'], {0: 682, 1: 161}, (embarked, p=5.017855245e-05, chi=16.4413525404, groups=[['C'], ['Q', 'S']]))
    ├── (['C'], {0: 109, 1: 48}, <Invalid Chaid Split>)
    └── (['Q', 'S'], {0: 573, 1: 113}, <Invalid Chaid Split>)

or to test the continuous dependent variable case:

python -m CHAID tests/data/titanic.csv fare sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05 --dependent-variable-type continuous
([], {'s.t.d': 51.727293077231302, 'mean': 33.270043468296414}, (embarked, p=8.46027456424e-24, score=55.3476155546, groups=[['C'], ['Q', '<missing>'], ['S']]), dof=1308))
├── (['C'], {'s.t.d': 84.029951444532529, 'mean': 62.336267407407405}, (sex, p=0.0293299541476, score=4.7994643184, groups=[['female'], ['male']]), dof=269))
│   ├── (['female'], {'s.t.d': 90.687664523113241, 'mean': 81.12853982300885}, <Invalid Chaid Split>)
│   └── (['male'], {'s.t.d': 76.07029674707077, 'mean': 48.810619108280257}, <Invalid Chaid Split>)
├── (['Q', '<missing>'], {'s.t.d': 15.902095006812658, 'mean': 13.490467999999998}, <Invalid Chaid Split>)
└── (['S'], {'s.t.d': 37.066877311088625, 'mean': 27.388825164113786}, (sex, p=3.43875930713e-07, score=26.3745361415, groups=[['female'], ['male']]), dof=913))
    ├── (['female'], {'s.t.d': 48.971933059814894, 'mean': 39.339305154639177}, <Invalid Chaid Split>)
    └── (['male'], {'s.t.d': 28.242580058030033, 'mean': 21.806819261637241}, <Invalid Chaid Split>)

Note that the frequency of the dependent variable is replaced with the standard deviation and mean of the continuous set at each node and that any NaNs in the dependent set are automatically converted to 0.0.

Generating Splitting Rules

Append --rules to the cli or call tree.classification_rules(node) (either pass in the node or if node is None then it will return all splitting rules)

python -m CHAID tests/data/titanic.csv fare sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05 --dependent-variable-type continuous --rules
{'node': 2, 'rules': [{'variable': 'sex', 'data': ['female']}, {'variable': 'embarked', 'data': ['C']}]}
{'node': 3, 'rules': [{'variable': 'sex', 'data': ['male']}, {'variable': 'embarked', 'data': ['C']}]}
{'node': 4, 'rules': [{'variable': 'embarked', 'data': ['Q', '<missing>']}]}
{'node': 6, 'rules': [{'variable': 'sex', 'data': ['female']}, {'variable': 'embarked', 'data': ['S']}]}
{'node': 7, 'rules': [{'variable': 'sex', 'data': ['male']}, {'variable': 'embarked', 'data': ['S']}]}

Parameters

Run python -m CHAID -h to see description of command line arguments

How to Read the Tree

We'll start with a real world example using the titanic dataset.

First make sure to install all required packages:

python setup.py install && pip install ipdb

Run:

python -m CHAID tests/data/titanic.csv survived sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05

after placing an ipdb statement on like 55 on __main__.py as in the example below. The parameters mean max depth two 4 levels, a minimum parent node size threshold to 2 and merge the groups if the p-value is greater than 0.05 when comparing the groups.

82        tree = Tree.from_pandas_df(data, independent_variables,
83                                   nspace.dependent_variable[0],
84                                   variable_types=types, **config)
---> 85   import ipdb; ipdb.set_trace()
86    
87        if nspace.classify:
88            predictions = pd.Series(tree.node_predictions())
89            predictions.name = 'node_id'
90            data = pd.concat([data, predictions], axis=1)
91            print(data.to_csv())
92        elif nspace.predict:

Running tree.print_tree() gives:

([], {0: 809, 1: 500}, (sex, p=1.47145310169e-81, score=365.886947811, groups=[['female'], ['male']]), dof=1))
├── (['female'], {0: 127, 1: 339}, (embarked, p=9.17624191599e-07, score=24.0936494474, groups=[['C', '<missing>'], ['Q', 'S']]), dof=1))
│   ├── (['C', '<missing>'], {0: 11, 1: 104}, <Invalid Chaid Split>)
│   └── (['Q', 'S'], {0: 116, 1: 235}, <Invalid Chaid Split>)
└── (['male'], {0: 682, 1: 161}, (embarked, p=5.017855245e-05, score=16.4413525404, groups=[['C'], ['Q', 'S']]), dof=1))
    ├── (['C'], {0: 109, 1: 48}, <Invalid Chaid Split>)
    └── (['Q', 'S'], {0: 573, 1: 113}, <Invalid Chaid Split>)

as show above. The first line is the root node, all the data is present in this node. The the vertical bars originating from a node represents paths to that node's children.

Running tree.tree_store will give you a list of all the nodes in the tree:

[
  ([], {0: 809, 1: 500}, (sex, p=1.47145310169e-81, score=365.886947811, groups=[['female'], ['male']]), dof=1)),
  (['female'], {0: 127, 1: 339}, (embarked, p=9.17624191599e-07, score=24.0936494474, groups=[['C', '<missing>'], ['Q', 'S']]), dof=1)),
  (['C', '<missing>'], {0: 11, 1: 104}, <Invalid Chaid Split>), (['Q', 'S'], {0: 116, 1: 235}, <Invalid Chaid Split>),
  (['male'], {0: 682, 1: 161}, (embarked, p=5.017855245e-05, score=16.4413525404, groups=[['C'], ['Q', 'S']]), dof=1)),
  (['C'], {0: 109, 1: 48}, <Invalid Chaid Split>), (['Q', 'S'], {0: 573, 1: 113}, <Invalid Chaid Split>)
]

So let's inspect the root node tree.tree_store[0]:

([], {0: 809, 1: 500}, (sex, p=1.47145310169e-81, score=365.886947811, groups=[['female'], ['male']]), dof=1))

Nodes have certain properties. Firstly, they show the column that was chosen to split to this node (for a root node the column is empty '([])'). The second property {0: 809, 1: 500} show the members of that node, and represent the current frequency of the dependent variable. In this case, it is all the answers in the 'survived' column, as that was the first column past to the program in the command line (python -m CHAID tests/data/titanic.csv survived). The next property represents the splitting of the node. What column was chosen to make that split (in this case, sex), the p-value of the split and the chi-score and most importantly, which variables in sex create the new nodes and the degrees of freedom associated with that split (1, in this case)

These properties that can be accessed:

ipdb> root_node = tree.tree_store[0]
ipdb> root_node.choices
[]
ipdb> root_node.members
{0: 809, 1: 500}
ipdb> root_node.split
(sex, p=1.47145310169e-81, score=365.886947811, groups=[['female'], ['male']]), dof=1)

The split variable can be further inspected:

ipdb> split = root_node.split
ipdb> split.column
'sex'
ipdb> split.p
1.4714531016922664e-81
ipdb> split.score
365.88694781112048
ipdb> split.dof
1
ipdb> split.groupings
"[['female'], ['male']]"

Therefore, in this example, the root node is split on the column 'sex' in the data, splitting up the females and males. These females and males each form a new node and further down, the all male and all female nodes are split on the column 'embarked' (although they needn't split on the same column). A <Invalid Chaid Split> is reached when either the node is pure (only one dependent variable remains) or when a terminating parameter is met (e.g. min node size, or max depth [see tree parameters above])

The conclusion drawn from this tree is that: "Gender was the most important factor driving the survival of people on the titanic. Whereby females had a much higher likelihood of surviving (survival = 1 in the survival column and 0 means they died). Of those females, those who embarked first class (class 'C', node 2) had a much higher likelihood of surviving."

Exporting the tree

If you want to export the tree to a dot file, then use:

tree.to_tree()

This creates a treelib which has a .to_graphviz() method here.

In order to use visually graph the CHAID tree, you'll need to install two more libraries that aren't distributed via pypi:

  • graphviz - see here for platform specific installations
  • orca - see the README.md for platform specific installations

You can export the tree to .gv and png using:

tree.render(path=None, view=False)

Which will save it to a file specified at path and can be instantly viewed when view=True.

This can also be triggered from the command line using --export or --export-path. The former causes it to be stored in a newly created trees folder and the latter specifies the location of the file. Both will trigger an auto-viewing of the tree. E.g:

python -m CHAID tests/data/titanic.csv survived sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05 --export
python -m CHAID tests/data/titanic.csv survived sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05 --export-path YOUR_PATH.gv

The output will look like:

Testing

CHAID uses pytest for its unit testing. The tests can be run from the root of a checkout with:

py.test

If you so wish to run the unit tests across multiple python versions to make sure your changes are compatible, run: tox (detox to run in parallel). You may need to run pip install tox tox-pyenv detox & brew install pyenv beforehand.

Caveats

  • Unlike SPSS, this library doesn't modify the data internally. This means that weight variables aren't rounded as they are in SPSS.
  • Every row is valid, even if all values are NaN or undefined. This is different to SPSS where in the weighted case it will strip out all rows if all the independent variables are NaN

Upcoming Features

  • Accuracy Estimation using Machine Learning techniques on the data
  • Binning of continuous independent variables

Generating the CHANGELOG.md

gem install github_changelog_generator && github_changelog_generator --exclude-labels maintenance,refactor,testing

chaid's People

Contributors

jihaekor avatar mjpieters avatar rambatino avatar xulaus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chaid's Issues

Error while importing in Jupyter Notebook

I have installed last version of CHAID Package, when I am trying to import CHAID into my notebook . getting a syntax error

from CHAID import Tree

error message:

 File "C:\...\Local\Continuum\anaconda3\lib\site-packages\CHAID\graph.py", line 75
    file = 'C:\...\Documents\Python Scripts\CHAID\temp\' + ("%.20f" % time.time()).replace('.', '') + '.png'


    ^
SyntaxError: invalid syntax

Maybe an error in the doc?

Hi Rambatino,
thanks for your work.
Maybe this code:
tree = Tree.from_pandas_df(df, dict(zip(independent_variable_columns, 'nominal' * 3)), dep_variable)
should be:
tree = Tree.from_pandas_df(df, dict(zip(independent_variable_columns, ['nominal'] * 3)), dep_variable)
Regards

Continuous independent variables?

Hi,

Is there any way to specify that certain independent variables are continuous? For example, if I have a variable for age, I would like that to be continuous so that for the splits, it gives ranges of age instead of each individual age in that group.

Thanks!

missing value in ordinal feature and bonferroni adjustment

Sorry to trouble you. First, thank you for your project of CHAID.

Do you have read the pdf of (http://www.gad-allah.com/MBA%202010%20Ain%20Shames%20Univesity/Statistics/spss13/Algorithms/TREE-CHAID.pdf).

as url-pdf details:

  1. the adjusted p-value is calculated as p-value times a bonferroni multiplier
  2. for ordinal predictors, the algorithm first generates the best set of categories using all non-missing information from the data. next the algorithm identifies the category that is most similar to the missing category. finally the algorithm decides whether to merge the missing category with its most similar category or keep the missing category as a separate category . Two p-values are calculated, one for the set of categories formed by merging the missing category with its most similar category, and the other for the set of categories formed by adding the missing category as a separate category. Take the action that gives the smallest p-value.

after read the pdf file . I confuse where to add the process of ordinal feature with missing value(as 2 describe) in you project structure.

would you like to give some advice to implement.
would you like to consider this two problem in the later version.

Thank you again for read the issue.

Issue with members property

Issue with members property i.e.
node_obj = CHAID.Node()
print(node_obj) or node_obj.repr() fails because members property expects dep_v to be not none.
If dep_v is mandatory then default value should be removed in init.

Not being able to visualize the tree

Hi again,

Unfortunately, I cannot export the tree graph. Here is a reproducible example:

#library
from CHAID import Tree
X = np.array(([1, 2, 3] * 5) + ([2, 2, 3] * 5)).reshape(10, 3)
Y = arr = np.array(([1] * 5) + ([2] * 5))

#Start CHAID
model1 = Tree.from_numpy(X, Y, max_depth=4, min_child_node_size = 30)
model1.to_tree()

#visualize tree
#conda install -c plotly plotly-orca
import os
os.environ["PATH"] += os.pathsep + 'C:/ProgramData/Anaconda3/Library/bin/graphviz/'
model1.render(path=None, view=False)

This is the error I get:

model1.render(path=None, view=False)
Traceback (most recent call last):

  File "<ipython-input-15-795caf0b6056>", line 1, in <module>
    model1.render(path=None, view=False)

  File "C:\Users\diogo\anaconda3\lib\site-packages\CHAID\tree.py", line 291, in render
    Graph(self).render(path, view)

  File "C:\Users\diogo\anaconda3\lib\site-packages\CHAID\graph.py", line 77, in render
    g.render(path, view=view)

  File "C:\Users\diogo\anaconda3\lib\site-packages\graphviz\files.py", line 202, in render
    filepath = self.save(filename, directory)

  File "C:\Users\diogo\anaconda3\lib\site-packages\graphviz\files.py", line 166, in save
    with io.open(filepath, 'w', encoding=self.encoding) as fd:

OSError: [Errno 22] Invalid argument: 'trees\\2020-05-11 16:10:10.gv'

Could you help? thanks!

Couple questions

Hi, there, first of all thank you very much for this, it has helped me inmensely in my work these days. I have had a couple issues while running this though, and I'd like to ask for some guidance, please bear with me as im no expert in python.

  1. I'm running windows 64 bits + anaconda, primarily using python 3.5, however for some reason i was unable to use pip to install this module using pip in that environment (a lot of problems getting savreaderwriter to install aparently)

  2. I solved that by installing it on a 2.7 virtual environment i created, as that version of pip worked. This however has brough a full stack of issues (especially with encoding and ascii handling), while running a continuous dependent variable tree. Solved mostly by casting to string (using str()) where necesary.

  3. Using the to_tree function to convert to treelib worked okay though same ascii problems arised at some point, especially when creating the tags for each node. I got the plugin from treelib to convert_to_dot to work though, so im happy with that, even though the squares created on the .dot are a single line and therefore huge. Not sure how to get the tree properties (mean, std, etc) to print to newlines or something. This also feel like a p27 problem but idk.

  4. Lastly, i'd like to request or ask about a way to supress the invalid split messages, as I need to eventually pass on the tree rules on to our database expert for him to apply the resulting tree (which is continuous) to our full dataset (which will have a lot more variables in all likelyhood), and the invalid split message makes it harder to read.

Again, cant overstate how much this module has helped me, so thanks a lot!

Issue in running the "tree.render(path=None, view=False)"

Running the following code
tree.render(path=None, view=False)

is through this error.

File "C:\Users\ps\AppData\Local\Continuum\anaconda3\lib\site-packages\graphviz\files.py", line 166, in save
with io.open(filepath, 'w', encoding=self.encoding) as fd:

OSError: [Errno 22] Invalid argument: 'trees/2020-01-07 16:00:08.gv'

any solution on this please..?

Prediction

how do I predict using this class for data which have not been used for training? I see only model_predictions which only retrieves predictions for the training data... thanks a lot :)

Export error

Hi, while exporting the tree, I am getting the following error.
Capture

Error while running tree.render()

I don't know but for some reason running

tree.render(path='some_path', view=False)

Gives me the below:-

"Image export using the kaleido engine requires the kaleido package, which can be installed using pip install kaleido"

I've installed it already and still getting the same error. Any help?

how to get feature_importance

Hi, I have a question.
How can I get the importance of each independent variable?
I mean "feature_importance" in other ML libraries.

No Attribute "from_numpy"

Hi!

I think it is really great that you did this package. CHAID is a wonderful and useful technique. I tried to use errors but I get the following error:

model1 = tree.from_numpy(X, Y)
Traceback (most recent call last):

  File "<ipython-input-65-2de6107b2484>", line 1, in <module>
    model1 = tree.from_numpy(X, Y)

AttributeError: module 'CHAID.tree' has no attribute 'from_numpy'

Do you know what I am doing wrong?

Thanks in advance!

any working example?

Hi,

I am interested in using your package for my research. Do you have any working tutorial document on how to use it on a dataset?

thank you

Issue exporting tree graph

Hello, I get the following error when i try to export the tree graph.

tree.render(path=None, view=False)

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-28-e8b3f601885a> in <module>
----> 1 tree.render(path=None, view=False)

C:\Users\Public\Anaconda3\lib\site-packages\CHAID\tree.py in render(self, path, view)
    289 
    290     def render(self, path=None, view=False):
--> 291         Graph(self).render(path, view)

C:\Users\Public\Anaconda3\lib\site-packages\CHAID\graph.py in render(self, path, view)
     75                     edge_label = "     ({})     \n ".format(', '.join(map(str, node.choices)))
     76                     g.edge(str(node.parent), str(node.node_id), xlabel=edge_label)
---> 77             g.render(path, view=view)
     78 
     79     def bar_chart(self, node):

C:\Users\Public\Anaconda3\lib\site-packages\graphviz\files.py in render(self, filename, directory, view, cleanup, format, renderer, formatter, quiet, quiet_view)
    236         relative to the DOT source file.
    237         """
--> 238         filepath = self.save(filename, directory)
    239 
    240         if format is None:

C:\Users\Public\Anaconda3\lib\site-packages\graphviz\files.py in save(self, filename, directory)
    198 
    199         log.debug('write %d bytes to %r', len(data), filepath)
--> 200         with io.open(filepath, 'w', encoding=self.encoding) as fd:
    201             fd.write(data)
    202             if not data.endswith(u'\n'):

OSError: [Errno 22] Invalid argument: 'trees\\2021-06-10 12:27:30.gv'

I tried one of the suggested workarounds but I get another error :


tree.render(path=os.getcwd()+"\\file.gv", view=False)

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
C:\Users\Public\Anaconda3\lib\site-packages\graphviz\backend.py in run(cmd, input, capture_output, check, encoding, quiet, **kwargs)
    163     try:
--> 164         proc = subprocess.Popen(cmd, startupinfo=get_startupinfo(), **kwargs)
    165     except OSError as e:

C:\Users\Public\Anaconda3\lib\subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors, text)
    853 
--> 854             self._execute_child(args, executable, preexec_fn, close_fds,
    855                                 pass_fds, cwd, env,

C:\Users\Public\Anaconda3\lib\subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_start_new_session)
   1306             try:
-> 1307                 hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
   1308                                          # no special security

FileNotFoundError: [WinError 2] Le fichier spécifié est introuvable

During handling of the above exception, another exception occurred:

ExecutableNotFound                        Traceback (most recent call last)
<ipython-input-33-da293a29d1d6> in <module>
----> 1 tree.render(path=os.getcwd()+"\\file.gv", view=False)

C:\Users\Public\Anaconda3\lib\site-packages\CHAID\tree.py in render(self, path, view)
    289 
    290     def render(self, path=None, view=False):
--> 291         Graph(self).render(path, view)

C:\Users\Public\Anaconda3\lib\site-packages\CHAID\graph.py in render(self, path, view)
     75                     edge_label = "     ({})     \n ".format(', '.join(map(str, node.choices)))
     76                     g.edge(str(node.parent), str(node.node_id), xlabel=edge_label)
---> 77             g.render(path, view=view)
     78 
     79     def bar_chart(self, node):

C:\Users\Public\Anaconda3\lib\site-packages\graphviz\files.py in render(self, filename, directory, view, cleanup, format, renderer, formatter, quiet, quiet_view)
    241             format = self._format
    242 
--> 243         rendered = backend.render(self._engine, format, filepath,
    244                                   renderer=renderer, formatter=formatter,
    245                                   quiet=quiet)

C:\Users\Public\Anaconda3\lib\site-packages\graphviz\backend.py in render(***failed resolving arguments***)
    221         cwd = None
    222 
--> 223     run(cmd, capture_output=True, cwd=cwd, check=True, quiet=quiet)
    224     return rendered
    225 

C:\Users\Public\Anaconda3\lib\site-packages\graphviz\backend.py in run(cmd, input, capture_output, check, encoding, quiet, **kwargs)
    165     except OSError as e:
    166         if e.errno == errno.ENOENT:
--> 167             raise ExecutableNotFound(cmd)
    168         else:
    169             raise

ExecutableNotFound: failed to execute ['dot', '-Kdot', '-Tpng', '-O', 'file.gv'], make sure the Graphviz executables are on your systems' PATH```

Thanks in advance

Make predictions on testing set and calculate the propensity scores

Hi Authors,

Thank you for your great work and open source packages for CHAID implementation.

I am using your package in my project but find little information about how to make prediction from training CHAID model on the testing set and also is it possible to calculate the propensity scores based on the current capacity of this package? Looking forward to your reply, thanks :)

Risk broken, needs speccing

ipdb> create_tree(self, d_var, column_labels, value_labels).print_tree()
([], {'high': 5320, 'medium': 1923, 'low': 652}, (Please select the emotions that best show how you felt when watching the ad.You can select up to four emotions, or as few as one._Indifferent, p=0.0, score=1432.87251882, groups=[[u'No'], [u'Yes']]), dof=2))
├── ([u'No'], {'high': 4994, 'medium': 1172, 'low': 363}, (Please select the emotions that best show how you felt when watching the ad.You can select up to four emotions, or as few as one._Bored, p=6.6696881578e-272, score=1248.81114438, groups=[[u'No'], [u'Yes']]), dof=2))
│   ├── ([u'No'], {'high': 4931, 'medium': 993, 'low': 206}, <Invalid Chaid Split> - the max depth has been reached)
│   └── ([u'Yes'], {'high': 63, 'medium': 179, 'low': 157}, <Invalid Chaid Split> - the max depth has been reached)
└── ([u'Yes'], {'high': 326, 'medium': 751, 'low': 289}, (Please select the emotions that best show how you felt when watching the ad.You can select up to four emotions, or as few as one._Bored, p=1.99408063179e-18, score=81.5126971319, groups=[[u'No'], [u'Yes']]), dof=2))
    ├── ([u'No'], {'high': 298, 'medium': 558, 'low': 174}, <Invalid Chaid Split> - the max depth has been reached)
    └── ([u'Yes'], {'high': 28, 'medium': 193, 'low': 115}, <Invalid Chaid Split> - the max depth has been reached)

ipdb> create_tree(self, d_var, column_labels, value_labels).risk()
*** ValueError: could not convert string to float: high

Continuous Column name arg & Plotting Tree

Hi,

Great package, seems to be completely unique in the Python ecosystem, oddly enough. Really grateful for the contribution.

I'm running into two small problems.

First, the ContinuousColumn doesn't have a name arg. This breaks the API and also prevents splits from having names.

Second, I'm trying to visualize the tree and having trouble.

I've tried: This Tree(independent_cols, dependent_col).to_tree().to_graphviz() returns a NoneType, so I'm not sure how to plot it.

And this: Tree(independent_cols, dependent_col).render(path=name, view=True) and variations raises this error:

~/Library/Caches/pypoetry/virtualenvs/segmentation-creation-Qfvb1sdR-py3.8/lib/python3.8/site-packages/CHAID/tree.py in render(self, path, view)
    289 
    290     def render(self, path=None, view=False):
--> 291         Graph(self).render(path, view)

~/Library/Caches/pypoetry/virtualenvs/segmentation-creation-Qfvb1sdR-py3.8/lib/python3.8/site-packages/CHAID/graph.py in render(self, path, view)
     70             )
     71             for node in self.tree:
---> 72                 image = self.bar_chart(node)
     73                 g.node(str(node.node_id), image=image)
     74                 if node.parent is not None:

~/Library/Caches/pypoetry/virtualenvs/segmentation-creation-Qfvb1sdR-py3.8/lib/python3.8/site-packages/CHAID/graph.py in bar_chart(self, node)
     94 
     95         filename = os.path.join(self.tempdir, "node-{}.png".format(node.node_id))
---> 96         pio.write_image(fig, file=filename, format="png")
     97         return filename
     98 

~/Library/Caches/pypoetry/virtualenvs/segmentation-creation-Qfvb1sdR-py3.8/lib/python3.8/site-packages/plotly/io/_kaleido.py in write_image(fig, file, format, scale, width, height, validate, engine)
    266     # -------------
    267     # Do this first so we don't create a file if image conversion fails
--> 268     img_data = to_image(
    269         fig,
    270         format=format,

~/Library/Caches/pypoetry/virtualenvs/segmentation-creation-Qfvb1sdR-py3.8/lib/python3.8/site-packages/plotly/io/_kaleido.py in to_image(fig, format, width, height, scale, validate, engine)
    143     # ---------------
    144     fig_dict = validate_coerce_fig_to_dict(fig, validate)
--> 145     img_bytes = scope.transform(
    146         fig_dict, format=format, width=width, height=height, scale=scale
    147     )

~/Library/Caches/pypoetry/virtualenvs/segmentation-creation-Qfvb1sdR-py3.8/lib/python3.8/site-packages/kaleido/scopes/plotly.py in transform(self, figure, format, width, height, scale)
    159         if code != 0:
    160             message = response.get("message", None)
--> 161             raise ValueError(
    162                 "Transform failed with error code {code}: {message}".format(
    163                     code=code, message=message

ValueError: Transform failed with error code 525: Failed to execute 'getPointAtLength' on 'SVGGeometryElement': The element's path is empty.

Any guidance would be much appreciated.

Thanks!

Exhaustive chaid

What should be changed in Your code to get an "exhaustive CHAID" version of this algorithm?

Here is the explanation: ftp://ftp.software.ibm.com/software/analytics/spss/support/Stats/Docs/Statistics/Algorithms/13.0/TREE-CHAID.pdf

Warnings have appeared when running specs locally. The bit rot is real

tests/test_tree.py::test_zero_subbed_weighted_ndarry
  /Users/mark/zappi/CHAID/CHAID/stats.py:15: RuntimeWarning:

  invalid value encountered in true_divide

  /Users/mark/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:4544: RuntimeWarning:

  invalid value encountered in true_divide


tests/test_tree.py::TestBugFixes::test_incorrect_weighted_counts
  /Users/mark/zappi/CHAID/CHAID/stats.py:15: RuntimeWarning:

  invalid value encountered in true_divide


tests/test_tree.py::TestStoppingRules::test_min_child_node_size_does_not_stop_for_weighted_case
  /Users/mark/zappi/CHAID/CHAID/stats.py:15: RuntimeWarning:

  invalid value encountered in true_divide


tests/test_tree.py::TestStoppingRules::test_min_child_node_size_does_stop_for_weighted_case
  /Users/mark/zappi/CHAID/CHAID/stats.py:15: RuntimeWarning:

  invalid value encountered in true_divide


-- Docs: http://doc.pytest.org/en/latest/warnings.html

Missing dependencies

To create a graph, plotly-orca needs to be installed. This should at least be mentioned in the documentation.

This involves more than what Python setuptools dependencies can handle, but perhaps the psutil and requests dependencies (which can be handled as install_requires or extra_requires entries) should just be added to setup.py?

Output Tree as pandas DataFrame

Is there any way I can output the Tree as a pandas DataFrame? Just wondering if there is a function to do this, or if I will need to write my own code to do that.

Slow

This algo works terribly slow and really needs binning of scale inputs. I love this algorithm, but can't really use it as it is. Needs more overall optimization.

Basic clarifications

I am new to the CHAID algorithm in general and also to this package, so could you please give me a little clarification to what certain things mean? From what I understand so far, the dataset is a numpy array containing numbers, with several independent variables and one dependent variable. Then, constructing a tree gives us the CHAID output. Is this correct so far?
For the actual tree, could someone help me understand what this output means in the context of the example in the README?

([], {1: 5, 2: 5}, ('a', p=0.001565402258, score=10.0, groups=[[1], [2]]), dof=1))
├── ([1], {1: 5, 2: 0}, <Invalid Chaid Split>)
└── ([2], {1: 0, 2: 5}, <Invalid Chaid Split>)

(I know these are all simple questions but I am new to all of this so would like to make sure I understand the basics)

Thank you

Feature selection difference with SPSS

I have a question regarding the chaid implementation in this package and SPSS.

I compared the output from this package and from SPSS. The later appeared to have some sort of selection built in because it only uses some of the features from the entire data set. Versus this package uses a lot more features.

When I limit the python test to same features SPSS selected, it returns the same tree/rules.

Do you know how or why this happens? My need is to try to recreate the SPSS version as much as possible.

Thank you!

model_predictions fails with categorical dependant variables

If the dependent variable is categorical, where categories are strings, the method model_predictions fails. The problem is that the the pred array is initialized as:

pred = np.zeros(self.data_size)

and that enforces predictions to be numerical. In order to solve that, the model_predictions could be rewritten to something like the following:

pred = [None] * self.data_size
for node in self:
    if node.is_terminal:
        max_val = max(node.members, key=node.members.get)
        for i in node.indices:
            pred[i] = max_val
return pred

Best regards

test model performance on validation dataset

Hi, thank you very much for the implementation!

I plan to split my dataset into development and validation set, then I would like to build the CHAID tree on development set, and test its performance on validation set.

It would be really nice if there could be a function that outputs the rules of segmentation and/or applies the rules (i.e. does model prediction) on the validation set.

Thanks a lot for your help!

I got the path to work, but now I'm back to the initial problem I had with the invalid argument in trees. -_-

I got the path to work, but now I'm back to the initial problem I had with the invalid argument in trees. -_-
I tried to specify my path to somewhere on my computer, then ran into an access problem, which might not be worth investigating since I am just an intern...

Thank you for your responses! The CHAID package works really well for me otherwise and has been helpful.

image

Originally posted by @soonmi-m in #116 (comment)

Nominal Column is not defined

In the documentation (README.md) code snippet (Creating a CHAID tree)

Nominal column variable/function is not defined..

So while running the code snippet getting same error

min_child_node_size defaults to None

min_child_node_size defaults to 0 so that it doesn't break current API at version 2. This is inconsistent with min_parent_node_size and will thus have to be changed come version 3

tree.render() throws error while working with data bricks azure

I did try to install graphviz and orca both on databricks. but for the following error:

issue 1 :
Transform failed with error code 525: Failed to execute 'getPointAtLength' on 'SVGGeometryElement': The element's path is empty.

Complete error here :

ValueError Traceback (most recent call last)
in
14
15 ## print the tree (though not enough power to split)
---> 16 tree.render(path = None, view = True)

/local_disk0/.ephemeral_nfs/envs/pythonEnv-7a53298c-35a8-4523-9bec-07dd3847e73a/lib/python3.8/site-packages/CHAID/tree.py in render(self, path, view)
289
290 def render(self, path=None, view=False):
--> 291 Graph(self).render(path, view)

/local_disk0/.ephemeral_nfs/envs/pythonEnv-7a53298c-35a8-4523-9bec-07dd3847e73a/lib/python3.8/site-packages/CHAID/graph.py in render(self, path, view)
70 )
71 for node in self.tree:
---> 72 image = self.bar_chart(node)
73 g.node(str(node.node_id), image=image)
74 if node.parent is not None:

/local_disk0/.ephemeral_nfs/envs/pythonEnv-7a53298c-35a8-4523-9bec-07dd3847e73a/lib/python3.8/site-packages/CHAID/graph.py in bar_chart(self, node)
94
95 filename = os.path.join(self.tempdir, "node-{}.png".format(node.node_id))
---> 96 pio.write_image(fig, file=filename, format="png")
97 return filename
98

/databricks/python/lib/python3.8/site-packages/plotly/io/_kaleido.py in write_image(fig, file, format, scale, width, height, validate, engine)
266 # -------------
267 # Do this first so we don't create a file if image conversion fails
--> 268 img_data = to_image(
269 fig,
270 format=format,

/databricks/python/lib/python3.8/site-packages/plotly/io/_kaleido.py in to_image(fig, format, width, height, scale, validate, engine)
143 # ---------------
144 fig_dict = validate_coerce_fig_to_dict(fig, validate)
--> 145 img_bytes = scope.transform(
146 fig_dict, format=format, width=width, height=height, scale=scale
147 )

/local_disk0/.ephemeral_nfs/envs/pythonEnv-7a53298c-35a8-4523-9bec-07dd3847e73a/lib/python3.8/site-packages/kaleido/scopes/plotly.py in transform(self, figure, format, width, height, scale)
159 if code != 0:
160 message = response.get("message", None)
--> 161 raise ValueError(
162 "Transform failed with error code {code}: {message}".format(
163 code=code, message=message

ValueError: Transform failed with error code 525: Failed to execute 'getPointAtLength' on 'SVGGeometryElement': The element's path is empty.

issue 2
is there a way to get each edge seq along with filter condition as a array sorted by root to end leave:
Ex.
node A, {A == 0}
node A, {A == 0} & node B, {B == 0}
node A, {A == 0} & node B, {B == 0} & node C, {C == 0}
i want to use this to calculate how much data flow changing after each node.

Thanks in advance!!
Tarun

Why isn't there a predict function ?

I split the data into train and test data.I have run the chaid on train data,Now I would like to use it to predict the output of test data.I want to do this for classification.

CHAID tree to json

Hi,
I am trying to get the json string corresponding to the CHAID tree. I tried converting the CHAID tree to a treelib tree, and then the treelib package has a method to_json().
However, when I call to_json() on the treelib tree, I get the error:

File "C:\Program Files\Anaconda3\lib\site-packages\treelib\tree.py", line 226, in to_dict
    tree_dict = {ntag: {"children": []}}
  File "C:\Program Files\Anaconda3\lib\site-packages\CHAID\node.py", line 42, in __hash__
    return hash(self.__dict__)
TypeError: unhashable type: 'dict'

Is there any way I can solve this problem? Any tips would be appreciated. Thanks!

Creating tree different from README

Hi,

I am running the example for creating a CHAID tree in the README. When I get to the line tree.print_tree(), I get: ([], {1: 5, 2: 5}, ).
This is different from the result in the README, so I was confused why this is happening.
Also, could you give an overview of what the attributes are in the Tree object so I can better understand the output?

Thanks!

Documentation for library

Hello team

Can somebody create documentation for the library. So that some one can understand it easily. May be on read the docs

Prior Nodes

is there a way to get the prior (parent node) for each node on the --rules ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.