GithubHelp home page GithubHelp logo

rosasalberto / mil Goto Github PK

View Code? Open in Web Editor NEW
58.0 1.0 11.0 9.96 MB

Multiple instance learning library for Python

License: MIT License

Python 100.00%
mil multiple-instance-learning python mnist-bags-dataset apr deep-mil miles

mil's Introduction

mil: multiple instance learning library for Python

When working on a research problem, I found myself with the multiple instance learning (MIL) framework, which I found quite interesting and unique. After carefully reviewing the literature, I decided to try few of the algorithms on the problem I was working on, but surprisingly, there was no standard, easy, and updated MIL library for any programming language. So... here we are.
The mil library tries to achieve reproducible and productive research using the MIL framework.


Table of Contents


Installation

Use the package manager pip to install mil.

$ pip install mil

Requirements

The requirement packages for mil library are: numpy, scikit-learn, scipy, tensorflow or tensorflow-gpu. Installing mil with the package manager does not install the package dependencies. So install them with the package manager manually if not already downloaded.

$ pip install numpy
$ pip install scikit-learn
$ pip install scipy
$ pip install tensorflow

Features

The overall implementation tries to be as much user-friendly as possible. That's why most of it is constructed on top of sklearn and tensorflow.keras.

data

Very well known datasets of the multiple instance learning framework have been added to the library. For each of the datasets a train and test split have been done for reproducibility purposes. The API is similar to the tensorflow datasets in order to create and experiment in a fast and easy way.

# importing all the datasets modules
from mil.data.datasets import musk1, musk2, protein, elephant, corel_dogs, \
                              ucsb_breast_cancer, web_recommendation_1, birds_brown_creeper, \
                              mnist_bags
# load the musk1 dataset
(bags_train, y_train), (bags_test, y_test) = musk1.load()

Also, the mnist_bags dataset has been created. The main reason of creating this dataset is to have a good benchmark to evaluate the instances predictions. To be more specific, if we can classify correctly a bag, can we detect which instance/s caused this classification? In the mnist_bags dataset, there are 3 different types of problems with their own dataset. https://drive.google.com/drive/folders/1_F9qAIrOUQBPTBwSQrIrn0ZzBSkMBfrK?usp=sharing

  1. The bag 'b' is positive if the instance '7' is contained in 'b'
# importing all the datasets modules
from mil.data.datasets import mnist_bags

# load the mnist_bags
(bags_train, y_train), (bags_test, y_test) = mnist_bags.load()
  1. The bag 'b' is positive if the instance '2' and '3' are contained in 'b'
# importing all the datasets modules
from mil.data.datasets import mnist_bags

# load the mnist_bags 2 and 3
(bags_train, y_train), (bags_test, y_test) = mnist_bags.load_2_and_3()
  1. The bag 'b' is positive if the instance '4' and '2' are located in consecutive instances in 'b'
# importing all the datasets modules
from mil.data.datasets import mnist_bags

# load the mnist_bags 4 2
(bags_train, y_train), (bags_test, y_test) = mnist_bags.load_42()

bag_representation

In multiple instance learning, bag representation is the technique that consists in obtaining a unique vector representing all the bag. The classes implemented in the mil.bag_representation inherit from BagRepresentation base class which is a wrapper to sklearn transformer which have to implement fit and transform method.

  • MILESMapping
  • DiscriminativeMapping
  • ArithmeticMeanBagRepresentation
  • MedianBagRepresentation
  • GeometricMeanBagRepresentation
  • MinBagRepresentation
  • MaxBagRepresentation
  • MeanMinMaxBagRepresentation
from mil.bag_representation import MILESMapping, DiscriminativeMapping, ArithmeticMeanBagRepresentation, \
                                   MedianBagRepresentation, GeometricMeanBagRepresentation, MinBagRepresentation, \
                                   MaxBagRepresentation, MeanMinMaxBagRepresentation

dimensionality_reduction

A wrapper to sklearn.decomposition and sklearn.feature_selection.

# for example import sklearn PCA
from mil.dimensionality_reduction import PCA

metrics

Includes a manager to handle all the metrics, some custom metric and a wrapper of tensorflow.keras.metrics. Custom metrics have to inherit from Metrics base class and implement methods update_state, result, and reset_states.

# importing a custom metric
from mil.metrics import Specificity

# importing a keras metric
from mil.metrics import AUC

models

It contains all the end-to-end models. All the models implement a sklearn-like structure with fit, predict, and sometimes get_positive_instances when the method allows it.

  • MILES
  • APR
  • AttentionDeepPoolingMil
# importing mil models
from mil.models import APR, AttentionDeepPoolingMil, MILES

# importing sklearn models
from mil.models import RandomForestClassifier, SVC

It is also a wrapper to sklearn.svm, sklearn.ensemble, sklearn.linear_model, and sklearn.neighbors.

preprocessing

It contains few transformers to normalize and standardize bags of type list, and is also a wrapper to sklearn.preprocessing.

# standarize bags of lists
from mil.preprocessing import StandarizerBagsList

utils

It contains few utility functions, such as bags2instances, padding, progress bar ...

# for example importing bags2instances function
from mil.utils.utils import bags2instances

validators

A wrapper to sklearn.model_selection. Includes all the validation strategies to be used in the training process.

# for example importing sklearn KFold
from mil.validators import KFold

valid = KFold(n_splits=10, shuffle=True)

trainer

It is the central part of the library, it allows to train, and evaluate models in a very simple and intuitive way. It has 4 principal methods.

  1. prepare(model, preprocess_pipeline=[], metrics=[]).
    Which is kind of what 'compile' method is for keras models. What it does is preparing the training and evaluation routine. The 'model' parameter accepts any of the mil.models objects. The 'preprocess_pipeline' parameter is a list containing all the transforming operations we wish to do before inputing the data into the 'model' object. It basically accepts any sklearn transformer. The 'metrics' takes some strings of typical metrics, or the callables modules from the metrics.

  2. fit(X_train, y_train, X_val=None, y_val=None, groups=None, validation_strategy=None, sample_weights=None, verbose=1)
    This is the method to train the model. It also handles a sample_weights parameters and mil.validators objects and custom validations splits.

  3. predict(X_train, y_train, X_val=None, y_val=None, groups=None, validation_strategy=None, sample_weights=None, verbose=1)
    This method is used to get the predictions of the model.

  4. get_positive_instances(X)
    Whenever the model has implemented the method it returns the positive instances from the bags.


Usage

# importing dataset
from mil.data.datasets import musk1
# importing bag_representation
from mil.bag_representation import MILESMapping
# importing validation strategy
from mil.validators import LeaveOneOut
# importing final model, which in this case is the SVC classifier from sklearn
from mil.models import SVC
# importing trainer
from mil.trainer import Trainer
# importing preprocessing 
from mil.preprocessing import StandarizerBagsList
# importing metrics, which in this case are from tf keras metrics
from mil.metrics import AUC

# loading dataset
(bags_train, y_train), (bags_test, y_test) = musk1.load()

# instantiate trainer
trainer = Trainer()

# preparing trainer
metrics = ['acc', AUC]
model = SVC(kernel='linear', C=1, class_weight='balanced')
pipeline = [('scale', StandarizerBagsList()), ('disc_mapping', MILESMapping())]
trainer.prepare(model, preprocess_pipeline=pipeline ,metrics=metrics)

# fitting trainer
valid = LeaveOneOut()
history = trainer.fit(bags_train, y_train, sample_weights='balanced', validation_strategy=valid, verbose=1)

# printing validation results for each fold
print(history['metrics_val'])

# predicting metrics for the test set
trainer.predict_metrics(bags_test, y_test)

For more examples, check examples subdirectory.


Contributing

Pull requests are welcome. Priority things are on To-do-list. For major changes, please open an issue first to discuss what you would like to change. Also, please make sure to update tests when appropriate.


To-do-list

Pending tasks to do:

  • Implement other algorithms, such as the SVM based ones.
  • Make C/C++ extension of the APR algorithm to run faster.
  • Make C/C++ extension of the MILESMapping algorithm to run faster.
  • MILESMapping generates a symmetric matrix of bag instance similarity, optimize it to calculate only half matrix and apply other possible optimizations to reduce time and space complexity.
  • Implement get_positive_instances for MILES model.
  • Implement logging and replace print functions calls.
  • Add parallelization for Trainer, for example in the validation folds.
  • Implement Tuner class for hyperparameter tuning.
  • Implement Callbacks for using on Trainer.
  • Add one cycle learning rate to use it on optimizers of KerasClassifiers models.
  • For the Trainer class, implement to get the best validation loss for calculating the metrics, right now when evaluating a KerasClassifier model, the metrics are the ones from the last epoch.

License

License

mil's People

Contributors

gtholpadiperitusai avatar rosasalberto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

mil's Issues

Libraries Versions

Hey there,

I'm having some trouble with the library and sorting out the dependencies. According to the docs, I need to install these:

pip install numpy
pip install scikit-learn
pip install scipy
pip install tensorflow

I used pip, and the versions were chosen automatically. But now, TensorFlow and Keras are giving me a headache. Any chance someone could share the working versions for these?

Here's what I've figured out so far:

Keras should be version 2.6 or lower.
TensorFlow should be version 2.8 or lower.

The tricky part is, every time I try downgrading to fix one error, a new one pops up. Any tips or shared wisdom on getting the right mix of versions would be a lifesaver.

Thanks a bunch! ๐Ÿ™Œ

Install From Source

I would like to install this package from source but there is no setup.py file. Is there a simple way to install this package from source for development?

Error loading data when installed with pip

>>> from mil.data.datasets import musk1
>>> (bags_train, y_train), (bags_test, y_test) = musk1.load()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/mil/data/datasets/musk1.py", line 7, in load
    return load_data(current_file + './csv/musk1.csv')
  File "/usr/local/lib/python3.7/site-packages/mil/data/datasets/loader.py", line 10, in load_data
    df = pd.read_csv(filepath, header=None)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1891, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 374, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 673, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File /usr/local/lib/python3.7/site-packages/mil/data/datasets./csv/musk1.csv does not exist: '/usr/local/lib/python3.7/site-packages/mil/data/datasets./csv/musk1.csv'

package was installed with pip.

Error Training DeepAttentionMIL

(bags_train, y_train), (bags_test, y_test) = musk1.load()
trainer = Trainer()
metrics = [AUC, BinaryAccuracy]
model = AttentionDeepPoolingMil(gated=False, threshold=0.4)
pipeline = [('scale', StandarizerBagsList()), ('padding', Padding())]
trainer.prepare(model, preprocess_pipeline=pipeline ,metrics=metrics)
valid = KFold(n_splits=10, shuffle=True)
history = trainer.fit(bags_train, y_train, validation_strategy=valid, sample_weights='balanced',
                              verbose=1, model__epochs=10, model__batch_size=2, model__verbose=0)

gives the following error:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Users\amcdaniel39\Anaconda3\envs\emade37\lib\site-packages\mil\trainer\trainer.py", line 124, in fit
    self.__eval_training_data(X_train, y_train, **kwargs)
  File "C:\Users\amcdaniel39\Anaconda3\envs\emade37\lib\site-packages\mil\trainer\trainer.py", line 157, in __eval_training_data
    self.pipeline.fit(X_train, y_train, model__sample_weight=sample_weights, **kwargs)
  File "C:\Users\amcdaniel39\Anaconda3\envs\emade37\lib\site-packages\sklearn\pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "C:\Users\amcdaniel39\Anaconda3\envs\emade37\lib\site-packages\tensorflow\python\keras\wrappers\scikit_learn.py", line 223, in fit
    return super(KerasClassifier, self).fit(x, y, **kwargs)
  File "C:\Users\amcdaniel39\Anaconda3\envs\emade37\lib\site-packages\tensorflow\python\keras\wrappers\scikit_learn.py", line 166, in fit
    history = self.model.fit(x, y, **fit_args)
  File "C:\Users\amcdaniel39\Anaconda3\envs\emade37\lib\site-packages\tensorflow\python\keras\engine\training.py", line 709, in fit
    shuffle=shuffle)
  File "C:\Users\amcdaniel39\Anaconda3\envs\emade37\lib\site-packages\tensorflow\python\keras\engine\training.py", line 2673, in _standardize_user_data
    exception_prefix='target')
  File "C:\Users\amcdaniel39\Anaconda3\envs\emade37\lib\site-packages\tensorflow\python\keras\engine\training_utils.py", line 346, in standardize_input_data
    str(len(data)) + ' arrays: ' + str(data)[:200] + '...')
ValueError: Error when checking model target: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 2 array(s), but instead got the following list of 1
arrays: ...

The error implies that the model expects 2 inputs for training labels, while the example for DeepAttentionMIL has a single input. I am using TensoerFlow V1.14.0, and sklearn v0.24.2. is there a specific version that should be used for either of those dependancies?

multi classification

Hi,

Great Job! I am looking for multi-classification MIL. Could you please guide me how to use your package for my case?
thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.