skent259 / mildsvm Goto Github PK

Multiple Instance Learning with Distributions, SVM

License: Other

R 100.00%

multiple-instance-learning ordinal r svm weakly-supervised-learning distributional-data

mildsvm's Introduction

mildsvm

Weakly supervised (WS), multiple instance (MI) data lives in numerous interesting applications such as drug discovery, object detection, and tumor prediction on whole slide images. The mildsvm package provides an easy way to learn from this data by training Support Vector Machine (SVM)-based classifiers. It also contains helpful functions for building and printing multiple instance data frames.

The mildsvm package implements methods that cover a variety of data types, including:

ordinal and binary labels
weakly supervised and traditional supervised structures
vector-based and distributional-instance rows of data

A full table of functions with references is available below. We highlight two methods based on recent research:

omisvm() runs a novel OMI-SVM approach for ordinal, multiple instance (weakly supervised) data using the work of Kent and Yu (2022+)
mismm() run the MISMM approach for binary, weakly supervised data where the instances can be thought of as a matrix of draws from a distribution. This non-convex SVM approach is formalized and applied to breast cancer diagnosis based on morphological features of the tumor microenvironment in Kent and Yu (2022).

Usage

A typical MI data frame (a mi_df) with ordinal labels might look like this, with multiple rows of information for each of the bag_names involved and a label that matches each bag:

library(mildsvm)
data("ordmvnorm")

print(ordmvnorm)
#> # An MI data frame: 1,000 × 7 with 200 bags
#> # and instance labels: 1, 1, 2, 1, 1, ...
#>    bag_label bag_name    V1     V2      V3       V4     V5
#>  *     <int>    <int> <dbl>  <dbl>   <dbl>    <dbl>  <dbl>
#>  1         2        1 1.55  -0.977  1.33   -0.659   -0.694
#>  2         2        1 0.980 -2.10  -0.618   2.15    -0.718
#>  3         2        1 6.16  -0.275  2.07   -0.624    0.444
#>  4         2        1 2.90  -2.15  -0.0407 -0.0629   1.38 
#>  5         2        1 2.62  -1.70   1.35   -1.66     1.23 
#>  6         4        2 3.39  -0.927  1.95    0.216   -0.164
#>  7         4        2 3.05  -0.930  1.34   -0.457    0.362
#>  8         4        2 6.63  -4.57   4.66   -0.00729  1.03 
#>  9         4        2 4.38  -0.714  2.32    0.0996   0.379
#> 10         4        2 2.43  -4.28   1.08    0.283   -1.14 
#> # … with 990 more rows
# dplyr::distinct(ordmvnorm, bag_label, bag_name)

The mildsvm package uses the familiar formula and predict methods that R uses will be familiar with. To indicate that MI data is involved, we specify the unique bag label and bag name with mi(bag_label, bag_name) ~ predictors:

fit <- omisvm(mi(bag_label, bag_name) ~ V1 + V2 + V3,
              data = ordmvnorm, 
              weights = NULL)
print(fit)
#> An misvm object called with omisvm.formula 
#>  
#> Parameters: 
#>   method: qp-heuristic 
#>   kernel: linear  
#>   cost: 1 
#>   h: 1 
#>   s: 4 
#>   scale: TRUE 
#>   weights: FALSE 
#>  
#> Model info: 
#>   Levels of `y`: chr [1:5] "1" "2" "3" "4" "5"
#>   Features: chr [1:3] "V1" "V2" "V3"
#>   Number of iterations: 4
predict(fit, new_data = ordmvnorm)
#> # A tibble: 1,000 × 1
#>    .pred_class
#>    <fct>      
#>  1 2          
#>  2 2          
#>  3 2          
#>  4 2          
#>  5 2          
#>  6 4          
#>  7 4          
#>  8 4          
#>  9 4          
#> 10 4          
#> # … with 990 more rows

Or, if the data frame has the mi_df class, we can directly pass it to the function and all features will be included:

fit2 <- omisvm(ordmvnorm)
#> Warning: Weights are not currently implemented for `omisvm()` when `kernel ==
#> 'linear'`.
print(fit2)
#> An misvm object called with omisvm.mi_df 
#>  
#> Parameters: 
#>   method: qp-heuristic 
#>   kernel: linear  
#>   cost: 1 
#>   h: 1 
#>   s: 4 
#>   scale: TRUE 
#>   weights: FALSE 
#>  
#> Model info: 
#>   Levels of `y`: chr [1:5] "1" "2" "3" "4" "5"
#>   Features: chr [1:5] "V1" "V2" "V3" "V4" "V5"
#>   Number of iterations: 3

Installation

You can install the released version of mildsvm from CRAN with:

install.packages("mildsvm")

Alternatively, you can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("skent259/mildsvm")

Additional Usage

mildsvm also works well MI data with distributional instances. There is a 3-level structure with bags, instances, and samples. As in MIL, instances are contained within bags (where we only observe the bag label). However, for MILD, each instance represents a distribution, and the samples are drawn from this distribution.

You can generate MILD data with generate_mild_df():

# Normal(mean=0, sd=1) vs Normal(mean=3, sd=1)
set.seed(4)
mild_df <- generate_mild_df(
  ncov = 1, nimp_pos = 1, nimp_neg = 1, 
  positive_dist = "mvnormal", positive_mean = 3,
  negative_dist = "mvnormal", negative_mean = 0, 
  nbag = 4,
  ninst = 2, 
  nsample = 2
)
print(mild_df)
#> # An MILD data frame: 16 × 4 with 4 bags, 8 instances
#> # and instance labels: 0, 0, 0, 0, 0, ...
#>    bag_label bag_name instance_name      X1
#>        <dbl> <chr>    <chr>           <dbl>
#>  1         0 bag1     bag1inst1      1.51  
#>  2         0 bag1     bag1inst1     -0.463 
#>  3         0 bag1     bag1inst2      1.79  
#>  4         0 bag1     bag1inst2      1.67  
#>  5         0 bag2     bag2inst1      0.299 
#>  6         0 bag2     bag2inst1      0.666 
#>  7         0 bag2     bag2inst2      0.0118
#>  8         0 bag2     bag2inst2      0.146 
#>  9         1 bag3     bag3inst1      0.546 
#> 10         1 bag3     bag3inst1      0.473 
#> 11         1 bag3     bag3inst2      1.94  
#> 12         1 bag3     bag3inst2      1.25  
#> 13         1 bag4     bag4inst1      1.11  
#> 14         1 bag4     bag4inst1      0.768 
#> 15         1 bag4     bag4inst2      0.111 
#> 16         1 bag4     bag4inst2     -0.290

You can train a MISVM classifier using mismm() on the MILD data with the mild() formula specification:

fit3 <- mismm(mild(bag_label, bag_name, instance_name) ~ X1, data = mild_df, cost = 100)

# summarize predictions at the bag layer
library(dplyr)
mild_df %>% 
  dplyr::bind_cols(predict(fit3, mild_df, type = "raw")) %>% 
  dplyr::bind_cols(predict(fit3, mild_df, type = "class")) %>% 
  dplyr::distinct(bag_label, bag_name, .pred, .pred_class)
#> # A tibble: 4 × 4
#>   bag_label bag_name  .pred .pred_class
#>       <dbl> <chr>     <dbl> <fct>      
#> 1         0 bag1     -1.18  0          
#> 2         0 bag2      0.482 1          
#> 3         1 bag3      1.00  1          
#> 4         1 bag4      1.00  1

If you summarize a MILD data set (for example, by taking the mean of each covariate), you can recover a MIL data set. Use summarize_samples() for this:

mil_df <- summarize_samples(mild_df, .fns = list(mean = mean)) 
print(mil_df)
#> # A tibble: 8 × 4
#>   bag_label bag_name instance_name    mean
#>       <dbl> <chr>    <chr>           <dbl>
#> 1         0 bag1     bag1inst1      0.522 
#> 2         0 bag1     bag1inst2      1.73  
#> 3         0 bag2     bag2inst1      0.483 
#> 4         0 bag2     bag2inst2      0.0791
#> 5         1 bag3     bag3inst1      0.510 
#> 6         1 bag3     bag3inst2      1.59  
#> 7         1 bag4     bag4inst1      0.941 
#> 8         1 bag4     bag4inst2     -0.0896

You can train an MI-SVM classifier using misvm() on MIL data with the helper function mi():

fit4 <- misvm(mi(bag_label, bag_name) ~ mean, data = mil_df, cost = 100)

print(fit4)
#> An misvm object called with misvm.formula 
#>  
#> Parameters: 
#>   method: heuristic 
#>   kernel: linear  
#>   cost: 100 
#>   scale: TRUE 
#>   weights: ('0' = 0.5, '1' = 1) 
#>  
#> Model info: 
#>   Features: chr "mean"
#>   Number of iterations: 2

Methods implemented

Function	Method	Outcome/label	Data type	Extra libraries	Reference
`omisvm()`	`"qp-heuristic"`	ordinal	MI	gurobi	[1]
`mismm()`	`"heuristic"`	binary	distributional MI	—	[2]
`mismm()`	`"mip"`	binary	distributional MI	gurobi	[2]
`mismm()`	`"qp-heuristic"`	binary	distributional MI	gurobi	[2]
`misvm()`	`"heuristic"`	binary	MI	—	[3]
`misvm()`	`"mip"`	binary	MI	gurobi	[3], [2]
`misvm()`	`"qp-heuristic"`	binary	MI	gurobi	[3]
`mior()`	`"qp-heuristic"`	ordinal	MI	gurobi	[4]
`misvm_orova()`	`"heuristic"`	ordinal	MI	—	[3], [1]
`misvm_orova()`	`"mip"`	ordinal	MI	gurobi	[3], [1]
`misvm_orova()`	`"qp-heuristic"`	ordinal	MI	gurobi	[3], [1]
`svor_exc()`	`"smo"`	ordinal	vector	—	[5]
`smm()`	—	binary	distributional vector	—	[6]

Table acronyms

MI: multiple instance
SVM: support vector machine
SMM: support measure machine
OR: ordinal regression
OVA: one-vs-all
MIP: mixed integer programming
QP: quadratic programming
SVOR: support vector ordinal regression
EXC: explicit constraints
SMO: sequential minimal optimization

References

[1] Kent, S., & Yu, M. (2022+). Ordinal multiple instance support vector machines. In prep.

[2] Kent, S., & Yu, M. (2022). Non-convex SVM for cancer diagnosis based on morphologic features of tumor microenvironment. arXiv preprint arXiv:2206.14704.

[3] Andrews, S., Tsochantaridis, I., & Hofmann, T. (2002). Support vector machines for multiple-instance learning. Advances in neural information processing systems, 15.

[4] Xiao, Y., Liu, B., & Hao, Z. (2017). Multiple-instance ordinal regression. IEEE Transactions on Neural Networks and Learning Systems, 29(9), 4398-4413.

[5] Chu, W., & Keerthi, S. S. (2007). Support vector ordinal regression. Neural computation, 19(3), 792-815.

[6] Muandet, K., Fukumizu, K., Dinuzzo, F., & Schölkopf, B. (2012). Learning from distributions via support measure machines. Advances in neural information processing systems, 25.

mildsvm's People

Contributors

Stargazers

mildsvm's Issues

Separate mildsvm into several packages?

Wondering whether mildsvm has become too large and would be better split into a few packages that work together:

multipleinstance: Contains the main methods for creating/using mi_df and mild_df objects.
misvm: Contains svm-based multiple instance methods, including misvm, omisvm, misvm_orova, mior, etc
(possibly) mildsvm: Contains the additional tools for distributional data. Could be wrapped into misvm
(possibly) kernelmaps: Contains the tools for kernel feature maps (nystrom mapping, exact mapping, etc)

Idea is still a work in progress and will likely change

cv_misvm_mip doesn't use weights in function

The underlying functions that get used have weights arguments, but the argument was never added to this function. Need to update.

Fix GitHub actions

Want both R CMD check and Coverage to work in GitHub actions.

Main problem is an error in the workflow with the Gurobi package. Details of the issue appear here: r-lib/actions#582, which is currently unresolved

Improve argument handling in model functions

There is a lot of repeated code that is copied and pasted to start most functions. As the complexity of this package grows, I think it might be helpful to have a standard argument handing function, with some potential sub-functions.

This may not shorten the code (such checking is likely annoying to do in R), but it could make things more manageable in the future.

Update tests to use `expect_snapshot()` for some predictions

This locks down predictions to allow for better controls on whether results change.

See: https://testthat.r-lib.org/reference/expect_snapshot.html

Fix the `repr_inst` output of `mildsvm()` and `misvm()`

This output could be useful but is primarily jibberish right now. I think it just returns "1" for each of the positive bags. Need to work in the underlying methods to make sure this returns a consistent output.

Dual QP - find a better way to compute `b_`

In the kernel svm dual formation, the calculation of the intercept (b_) requires only using the alpha values that are in the interior of the $0 < \alpha_i < C$ constraint. The current implementation calculates b_ from alpha > 0, but I need to double the other side of this constraint and how the MI-SVM dual relates to that.

`mi_df()` keeps all instance_labels after subsetting

Need to adjust the [ method for mi_df() to also subset the 'instance_label' attribute

Add README

Simplify testing code with fixtures

See https://r-pkgs.org/test-design.html

Want to wrap all data into fixtures
Remove calls to "library"
Use skip package function in test that instead of my own helpers
Re-name test files with a "-" instead of "_"

Camel case and snake case

Many objects in the package have a mix of CamelCase and snake_case. There are also some parameters and variable names with whatever.case.this.is. Stylistically, this is not ideal and should be addressed before the initial release.

Problems in order of priority:

Add validations for mild, mildsvm, smm classes

Want to make sure that these objects are created properly

Add dual method to `omisvm()` function

Warning message "setting levels" occurs consistently

I think this is coming from the call to pROC::auc or pROC::roc, but it would be nice to have this warning message not show up.

Quick-fix: use suppressWarnings
Better-fix: figure out what the input needs to have to remove the warning

Pass S3 generic/method consistency check

Need to update the function arguments of generics to follow the guidelines of https://cran.r-project.org/doc/manuals/R-exts.html#Generic-functions-and-methods

Add open-source version of gurobi optimizations

Currently methods 'mip' and 'qp-heuristic' for mismv() and mildsvm() are only implemented in Gurobi, which is not open source. Consider adding an option to have these in an open-source implementation, if possible.

Related to #8

Add `mi_df()` function and class

This would be similar to the mild_df() function and class. Would include the following methods:

print.mi_df()
as.mi_df()

Want to allow the function misvm() and others to use this method for easy function calling

Update `mior()` to use dual calculation

Currently the optimization problem is solved via the dual but only uses the linear kernel

Potential issue in formulas with spaced colnames

I think that if a data frame has a space in the column name and a formula mi(y, bags) ~ . is used, the function will cause an error on some versions of R. Seems to work fine locally but it would be good to figure out which version this fails on.

cv functions don't work when min(n_posbags, n_negbags) < nfolds

This will always throw an error in calculating the AUC, even though the models can run just fine.

A quick fix might be to omit this AUC from the calculation. Alternatively, can restrict nfolds and message that this parameter is changing.

Add testing for utils-model.R file

These were all refactored from old code, so I have some confidence that they work. However, best practice would be to add more testing of these directly and then reduce the testing in old model files that used to check for things like this.

Clean up documentation

Remove instances of ##' in favor of #'
Add @inhereitParams where appropriate
Make sure all functions have descriptions
Make sure all functions have examples
Make sure all functions have returns

Create a unified prediction output

See https://tidymodels.github.io/model-implementation-principles/model-predictions.html

Currently, predict methods have different arguments and outputs.

Would be ideal to ensure that AUC and ROC metrics can come through as attributes

Add techniques to improve local minimums

This could include particle swarm optimization or a genetic algorithm. Most applicable to mildsvm() and misvm() as an alternative to the heuristic method.

Make sure default arguments in control work as expected

Specifically for sigma in misvm(), make sure the default when other control arguments are passed is 1/ncol(x)

Comparison methods function improvement

Allow the MI-SVM method that works on MilData to have a function list passed to it as an argument. Currently, the options are limited and require hard-coding to add new summary features.

Fix minor documentation issues

omisvm
- control has extra parameters, need to adjust
- weight only used in 'radial' kernel
predict.omisvm
- needs ordinal in type
mior
- method incorrect, should only have qp-heuristic
- weights currently not used

Worth checking other documentation as well for minor errors

Check to make sure matrices never drop

Adjust code to use X[ , ,drop = FALSE] when subsetting matrices (and non-tibble data frames) to make sure that the 2D structure is maintained. (from https://tidymodels.github.io/model-implementation-principles/function-interfaces.html)

Potential error in `misvm()` with method = 'heuristic' when data isn't ordered properly

Noticed in some test simulations, need to run some testing and figure out where the issue is.

Add helpful vignettes to document the function methodology

Can copy much of this from papers, but the long form documentation would be useful to explain all the function arguments

omisvm()
mismm()
misvm()

Fix `predict.mildsvm()` cutoffs

In practice, the predict method scores for mildsvm may be far off from the cutoff of 0. This means that it may be better to come up with a data-driven cutoff or add parameters to let the user specify the cutoff. Otherwise, there is a risk of poor performance

Update README with new functions and descriptions

Restructure functions within files and rename

Currently, the function files have poor naming and the functions within them could be better organized.

Thinking to have the following files

mildsvm.R -> renamed from mil_distribution.R
misvm.R -> largely renamed from comparison_methods.R
smm.R -> renamed from SMM.R, include the smm methods from comparison_methods.R
generate_mild_data.R -> renamed from GenerateData.R
bkfm_functions.R -> renamed from feature_map.R once the bkfm family of functions gets built
kme.R, probably fine as it is, think about this one

Fix `print.omisvm()`

Should say 'omisvm' object, not 'misvm' object

Check other print methods also

Predict.mildsvm doesn't need new_data if kernel is passed

new_data is somewhat redundant when the kernel is passed, along with new_bags and new_instances. At the very least, new_data shouldn't need these columns. Update logic in predict.mildsvm to reflect this and add a few tests to check that it works

MI-SMM, MI-SVM methods create objects with different components

Need to unify the list components of these objects (as well as possible, ok to have some differences), and add a method parameter into the constructor. The method parameter gets used in the "predict" function to create a unified set of outputs.

Some initial steps to think about

Build mildsvm object first with the mild object in mind
Write predict.mildsvm with predict.mild in mind
Adjust everything that makes a mild object to be a mildsvm object
Incorporate the mild predict into mildsvm predict

Add a couple toy data sets

It would be ideal to have a couple of toy data sets that could simplify the examples and testing. Another perk is to standardize the data across the examples and make it easier to have meaningful toy data in them

select_cv_folds2() could be improved

Right now it doesn't provide exactly even folds when it could. This issue stems from evenly splitting the positive and negative bags separately, but could be improved

MI_SVM selects wrong instances from bag when ordering changes

This bug will give a poor MI_SVM fit when the instances are not fed in with the positive bags first and arranged exactly by bag. Need to fix to select instances properly.

Improve efficiency of `cv_misvm()`

Currently, cv_misvm() doesn't use any efficiency in pre-computing the kernel matrix. It would be great to add in some logic that helped with that. Optimal would be only 1 kernel computation per value of cost. Note that `misvm(..., method="heuristic") currently does not support pre-computed kernel, and would need separate logic.

Add some scaling into feature map functions

When dataset is unscaled, passing to misvm directly, even with the scale = TRUE parameter, it doesn't get scaled in the kfm function. This causes a poorly scaled eigen-spectrum and it makes learning almost impossible for the svm.

One solution is to add the scaling directly into the feature map functions, which is ideal anyways.

Add print methods for modeling functions

At least need the following print methods

print.misvm
print.mildsvm
print.smm

Consider other print methods for things like

print.kfm_nystrom
print.kfm_exact

Ensure names of functions are consistent with paper

Potential updates

mildsvm() -> mismm()
misvm_ordinal() -> omisvm()

Open Source version of MIP problem?

Currently, the MIP solution for MI-SVM and MI-SMM uses the gurobi backend, which is not open source. It would be nice to additionally have an open-source MIQP solver backend to use, even if it is a bit slower.

Problem: I can't find one in R. I checked the ROI package documentation (https://epub.wu.ac.at/5858/1/ROI_StatReport.pdf), but it doesn't seem like there is an R package that solves MIQP. I can only find MILP or QP.

Add testing for misvm.R functions

Functions need testing and probably additional documentation with examples

Add testing for ordinal method examples

No option to pass kernel parameters to misvm heuristic

Currently, in the dev-version there isn't a way to pass the kernel parameters to misvm that can run in the dual. It's unclear the best way to do this, besides adding them into the control and passing to e1071::svm ultimately

Add linear kernel method in kme

It seems possible that different kernels could be useful in kme() and the code is set up decently to allow for that. The big problem is that this would require a new parameter for many of the different methods, though maybe this could get pushed into the ... argument. Sigma get's passed to it anyways

Need to think about this more. I think that all of the functions that use kme have a kernel matrix that can be passed, so in theory, this issue isn't urgent. But it would be better to tackle before the first release if it's going to get tackled.