GithubHelp home page GithubHelp logo

gtolomei / ml-feature-tweaking Goto Github PK

View Code? Open in Web Editor NEW
17.0 4.0 5.0 52.17 MB

This repository contains the source code associated with the method proposed by Tolomei et al. in their KDD 2017 research paper entitled "Interpretable Predictions of Tree-based Ensembles via Actionable Feature Tweaking"

Python 27.21% Jupyter Notebook 71.78% Shell 1.00%

ml-feature-tweaking's Introduction

Tweaking Features of Ensembles of Machine-Learned Trees

This repository contains the source code associated with the method proposed by Tolomei et al. in their KDD 2017 research paper entitled "Interpretable Predictions of Tree-based Ensembles via Actionable Feature Tweaking" [more information available at: KDD 2017 website or arXiv.org]

NOTE: This work has been developed by the authors of the paper while working at Yahoo Labs, London, UK. Although the method proposed is general and applicable to several different domains, the authors validate it on an online advertising use case. In particular, they demonstrate the ability of this approach to generate actionable recommendations for improving the quality of the ads served by Yahoo Gemini.
Due to confidentiality, any business-related detail has been removed from this repository, which however can still be used by other researchers working on related topics, such as ML model interpretability or adversarial ML just to name a few.

This repo is made up of 3 scripts which are supposed to be run in the same order as follows:

  1. dump_paths.py
  2. tweak_features.py
  3. compute_tweaking_costs.py

1. dump_paths.py

The first stage of the pipeline is accomplished by this script, which can be invoked as follows:

> ./dump_paths.py ${PATH_TO_SERIALIZED_MODEL} ${PATH_TO_OUTPUT_FILE}

where:
${PATH_TO_SERIALIZED_MODEL} is the path to the (binary) file containing a serialized, trained binary classifier (i.e., a scikit-learn tree-based ensemble estimator).
${PATH_TO_OUTPUT_FILE} is the path where the output file will be stored. This file will contain a plain-text representation of all the positive paths, namely all the paths extracted from all the trees in the ensemble whose leaves are labeled as positive.
Each line of the output file is a positive path, which in turn is a sequence of boolean tests with the following format:

[tree_id, [(feature_id, op, value), ..., (feature_id, op, value)]

where:

  • tree_id is the unique id of the tree within the ensemble.
  • feature_id is the unique id of the feature subject of the test.
  • op is the operator of the test: either '<=' or '>'.
  • value is the value against which the feature is tested.

2. tweak_features.py

The second stage of the pipeline is actually the core of the entire process. The script can be run as follows:

> ./tweak_features.py ${PATH_TO_DATASET} ${PATH_TO_SERIALIZED_MODEL} ${PATH_TO_POSITIVE_PATHS_FILE} \
${PATH_TO_OUTPUT_FILE} [--epsilon=x]

where:
${PATH_TO_DATASET} is the path to the dataset file used to train the binary classifier. This is assumed to be either a .tsv or a .csv file, where each line is an instance and each field is a feature. The very last field is supposed to be the target label (named 'class').
${PATH_TO_SERIALIZED_MODEL} as above.
${PATH_TO_POSITIVE_PATHS_FILE} is the path to the output file generated by the previous script dump_paths.py at stage 1.
${PATH_TO_OUTPUT_DIRECTORY} is the path to the directory where the output file will be stored. This file will be called transformations_${EPSILON}.tsv, where ${EPSILON} is the value of epsilon optional argument (by default epsilon=0.1).

3. compute_tweaking_costs.py

Once the set of candidate feature transformations (i.e., tweakings) have been successfully calculated, we can measure the actual costs of those transformations. This can be achieved by running the following script:

> ./compute_tweaking_costs.py  ${PATH_TO_DATASET} \
${PATH_TO_TRANSFORMATIONS} \
${PATH_TO_OUTPUT_DIRECTORY} \
--costfuncs=unmatched_component_rate,euclidean_distance,cosine_distance,jaccard_distance,pearson_correlation_distance

where:
${PATH_TO_DATASET} is the path to the dataset file used to train the binary classifier, as above.
${PATH_TO_TRANSFORMATIONS} is the path to the file containing the candidate transformations obtained with step 2.
${PATH_TO_OUTPUT_DIRECTORY} is the path to the directory where the output file will be stored. Finally, the optional argument costfuncs will contain a list of functions used to compute the cost of each transformation (default costfuncs=euclidean_distance.

The ultimate result of this step is the creation of 2 tsv files inside ${PATH_TO_OUTPUT_DIRECTORY} containing the costs and the signs of each transformation.

Additional steps can be performed using those files as input, depending on the final task goal.

Citation

If you use this implementation in your work, please add a reference/citation to the paper. You can use the following BibTeX entry:

@inproceedings{DBLP:conf/kdd/TolomeiSHL17,
  author    = {Gabriele Tolomei and
               Fabrizio Silvestri and
               Andrew Haines and
               Mounia Lalmas},
  title     = {Interpretable Predictions of Tree-based Ensembles via Actionable Feature
               Tweaking},
  booktitle = {Proceedings of the 23rd {ACM} {SIGKDD} International Conference on
               Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13
               - 17, 2017},
  pages     = {465--474},
  year      = {2017},
  crossref  = {DBLP:conf/kdd/2017},
  url       = {http://doi.acm.org/10.1145/3097983.3098039},
  doi       = {10.1145/3097983.3098039},
  timestamp = {Tue, 15 Aug 2017 16:11:01 +0200},
  biburl    = {http://dblp.org/rec/bib/conf/kdd/TolomeiSHL17},
  bibsource = {dblp computer science bibliography, http://dblp.org}
}

ml-feature-tweaking's People

Contributors

gtolomei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ml-feature-tweaking's Issues

ValueError when calling compute_transformation_cost function in cost computation step

It should be noted that each row of x_primes, starts with id, three_id, path_id, and path_lenght, so the feature values would start at index 4. currently, the code captures features from index 3 which will cause "ValueError: operands could not be broadcast together with shapes (n, ) (m, )".
so the following code:

tree_id = int(row[1][0])
path_id = int(row[1][1])
path_length = int(row[1][2])
x_prime = row[1][3:]

should be changed to:

tree_id = int(row[1][1])
path_id = int(row[1][2])
path_length = int(row[1][3])
x_prime = row[1][4:]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.