GithubHelp home page GithubHelp logo

skent259 / mildsvm Goto Github PK

View Code? Open in Web Editor NEW
2.0 0.0 0.0 886 KB

Multiple Instance Learning with Distributions, SVM

License: Other

R 100.00%
multiple-instance-learning ordinal r svm weakly-supervised-learning distributional-data

mildsvm's Introduction

mildsvm

CRAN status R-CMD-check Codecov test coverage

Weakly supervised (WS), multiple instance (MI) data lives in numerous interesting applications such as drug discovery, object detection, and tumor prediction on whole slide images. The mildsvm package provides an easy way to learn from this data by training Support Vector Machine (SVM)-based classifiers. It also contains helpful functions for building and printing multiple instance data frames.

The mildsvm package implements methods that cover a variety of data types, including:

  • ordinal and binary labels
  • weakly supervised and traditional supervised structures
  • vector-based and distributional-instance rows of data

A full table of functions with references is available below. We highlight two methods based on recent research:

  • omisvm() runs a novel OMI-SVM approach for ordinal, multiple instance (weakly supervised) data using the work of Kent and Yu (2022+)
  • mismm() run the MISMM approach for binary, weakly supervised data where the instances can be thought of as a matrix of draws from a distribution. This non-convex SVM approach is formalized and applied to breast cancer diagnosis based on morphological features of the tumor microenvironment in Kent and Yu (2022).

Usage

A typical MI data frame (a mi_df) with ordinal labels might look like this, with multiple rows of information for each of the bag_names involved and a label that matches each bag:

library(mildsvm)
data("ordmvnorm")

print(ordmvnorm)
#> # An MI data frame: 1,000 × 7 with 200 bags
#> # and instance labels: 1, 1, 2, 1, 1, ...
#>    bag_label bag_name    V1     V2      V3       V4     V5
#>  *     <int>    <int> <dbl>  <dbl>   <dbl>    <dbl>  <dbl>
#>  1         2        1 1.55  -0.977  1.33   -0.659   -0.694
#>  2         2        1 0.980 -2.10  -0.618   2.15    -0.718
#>  3         2        1 6.16  -0.275  2.07   -0.624    0.444
#>  4         2        1 2.90  -2.15  -0.0407 -0.0629   1.38 
#>  5         2        1 2.62  -1.70   1.35   -1.66     1.23 
#>  6         4        2 3.39  -0.927  1.95    0.216   -0.164
#>  7         4        2 3.05  -0.930  1.34   -0.457    0.362
#>  8         4        2 6.63  -4.57   4.66   -0.00729  1.03 
#>  9         4        2 4.38  -0.714  2.32    0.0996   0.379
#> 10         4        2 2.43  -4.28   1.08    0.283   -1.14 
#> # … with 990 more rows
# dplyr::distinct(ordmvnorm, bag_label, bag_name)

The mildsvm package uses the familiar formula and predict methods that R uses will be familiar with. To indicate that MI data is involved, we specify the unique bag label and bag name with mi(bag_label, bag_name) ~ predictors:

fit <- omisvm(mi(bag_label, bag_name) ~ V1 + V2 + V3,
              data = ordmvnorm, 
              weights = NULL)
print(fit)
#> An misvm object called with omisvm.formula 
#>  
#> Parameters: 
#>   method: qp-heuristic 
#>   kernel: linear  
#>   cost: 1 
#>   h: 1 
#>   s: 4 
#>   scale: TRUE 
#>   weights: FALSE 
#>  
#> Model info: 
#>   Levels of `y`: chr [1:5] "1" "2" "3" "4" "5"
#>   Features: chr [1:3] "V1" "V2" "V3"
#>   Number of iterations: 4
predict(fit, new_data = ordmvnorm)
#> # A tibble: 1,000 × 1
#>    .pred_class
#>    <fct>      
#>  1 2          
#>  2 2          
#>  3 2          
#>  4 2          
#>  5 2          
#>  6 4          
#>  7 4          
#>  8 4          
#>  9 4          
#> 10 4          
#> # … with 990 more rows

Or, if the data frame has the mi_df class, we can directly pass it to the function and all features will be included:

fit2 <- omisvm(ordmvnorm)
#> Warning: Weights are not currently implemented for `omisvm()` when `kernel ==
#> 'linear'`.
print(fit2)
#> An misvm object called with omisvm.mi_df 
#>  
#> Parameters: 
#>   method: qp-heuristic 
#>   kernel: linear  
#>   cost: 1 
#>   h: 1 
#>   s: 4 
#>   scale: TRUE 
#>   weights: FALSE 
#>  
#> Model info: 
#>   Levels of `y`: chr [1:5] "1" "2" "3" "4" "5"
#>   Features: chr [1:5] "V1" "V2" "V3" "V4" "V5"
#>   Number of iterations: 3

Installation

You can install the released version of mildsvm from CRAN with:

install.packages("mildsvm")

Alternatively, you can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("skent259/mildsvm")

Additional Usage

mildsvm also works well MI data with distributional instances. There is a 3-level structure with bags, instances, and samples. As in MIL, instances are contained within bags (where we only observe the bag label). However, for MILD, each instance represents a distribution, and the samples are drawn from this distribution.

You can generate MILD data with generate_mild_df():

# Normal(mean=0, sd=1) vs Normal(mean=3, sd=1)
set.seed(4)
mild_df <- generate_mild_df(
  ncov = 1, nimp_pos = 1, nimp_neg = 1, 
  positive_dist = "mvnormal", positive_mean = 3,
  negative_dist = "mvnormal", negative_mean = 0, 
  nbag = 4,
  ninst = 2, 
  nsample = 2
)
print(mild_df)
#> # An MILD data frame: 16 × 4 with 4 bags, 8 instances
#> # and instance labels: 0, 0, 0, 0, 0, ...
#>    bag_label bag_name instance_name      X1
#>        <dbl> <chr>    <chr>           <dbl>
#>  1         0 bag1     bag1inst1      1.51  
#>  2         0 bag1     bag1inst1     -0.463 
#>  3         0 bag1     bag1inst2      1.79  
#>  4         0 bag1     bag1inst2      1.67  
#>  5         0 bag2     bag2inst1      0.299 
#>  6         0 bag2     bag2inst1      0.666 
#>  7         0 bag2     bag2inst2      0.0118
#>  8         0 bag2     bag2inst2      0.146 
#>  9         1 bag3     bag3inst1      0.546 
#> 10         1 bag3     bag3inst1      0.473 
#> 11         1 bag3     bag3inst2      1.94  
#> 12         1 bag3     bag3inst2      1.25  
#> 13         1 bag4     bag4inst1      1.11  
#> 14         1 bag4     bag4inst1      0.768 
#> 15         1 bag4     bag4inst2      0.111 
#> 16         1 bag4     bag4inst2     -0.290

You can train a MISVM classifier using mismm() on the MILD data with the mild() formula specification:

fit3 <- mismm(mild(bag_label, bag_name, instance_name) ~ X1, data = mild_df, cost = 100)

# summarize predictions at the bag layer
library(dplyr)
mild_df %>% 
  dplyr::bind_cols(predict(fit3, mild_df, type = "raw")) %>% 
  dplyr::bind_cols(predict(fit3, mild_df, type = "class")) %>% 
  dplyr::distinct(bag_label, bag_name, .pred, .pred_class)
#> # A tibble: 4 × 4
#>   bag_label bag_name  .pred .pred_class
#>       <dbl> <chr>     <dbl> <fct>      
#> 1         0 bag1     -1.18  0          
#> 2         0 bag2      0.482 1          
#> 3         1 bag3      1.00  1          
#> 4         1 bag4      1.00  1

If you summarize a MILD data set (for example, by taking the mean of each covariate), you can recover a MIL data set. Use summarize_samples() for this:

mil_df <- summarize_samples(mild_df, .fns = list(mean = mean)) 
print(mil_df)
#> # A tibble: 8 × 4
#>   bag_label bag_name instance_name    mean
#>       <dbl> <chr>    <chr>           <dbl>
#> 1         0 bag1     bag1inst1      0.522 
#> 2         0 bag1     bag1inst2      1.73  
#> 3         0 bag2     bag2inst1      0.483 
#> 4         0 bag2     bag2inst2      0.0791
#> 5         1 bag3     bag3inst1      0.510 
#> 6         1 bag3     bag3inst2      1.59  
#> 7         1 bag4     bag4inst1      0.941 
#> 8         1 bag4     bag4inst2     -0.0896

You can train an MI-SVM classifier using misvm() on MIL data with the helper function mi():

fit4 <- misvm(mi(bag_label, bag_name) ~ mean, data = mil_df, cost = 100)

print(fit4)
#> An misvm object called with misvm.formula 
#>  
#> Parameters: 
#>   method: heuristic 
#>   kernel: linear  
#>   cost: 100 
#>   scale: TRUE 
#>   weights: ('0' = 0.5, '1' = 1) 
#>  
#> Model info: 
#>   Features: chr "mean"
#>   Number of iterations: 2

Methods implemented

Function Method Outcome/label Data type Extra libraries Reference
omisvm() "qp-heuristic" ordinal MI gurobi [1]
mismm() "heuristic" binary distributional MI [2]
mismm() "mip" binary distributional MI gurobi [2]
mismm() "qp-heuristic" binary distributional MI gurobi [2]
misvm() "heuristic" binary MI [3]
misvm() "mip" binary MI gurobi [3], [2]
misvm() "qp-heuristic" binary MI gurobi [3]
mior() "qp-heuristic" ordinal MI gurobi [4]
misvm_orova() "heuristic" ordinal MI [3], [1]
misvm_orova() "mip" ordinal MI gurobi [3], [1]
misvm_orova() "qp-heuristic" ordinal MI gurobi [3], [1]
svor_exc() "smo" ordinal vector [5]
smm() binary distributional vector [6]

Table acronyms

  • MI: multiple instance
  • SVM: support vector machine
  • SMM: support measure machine
  • OR: ordinal regression
  • OVA: one-vs-all
  • MIP: mixed integer programming
  • QP: quadratic programming
  • SVOR: support vector ordinal regression
  • EXC: explicit constraints
  • SMO: sequential minimal optimization

References

[1] Kent, S., & Yu, M. (2022+). Ordinal multiple instance support vector machines. In prep.

[2] Kent, S., & Yu, M. (2022). Non-convex SVM for cancer diagnosis based on morphologic features of tumor microenvironment. arXiv preprint arXiv:2206.14704.

[3] Andrews, S., Tsochantaridis, I., & Hofmann, T. (2002). Support vector machines for multiple-instance learning. Advances in neural information processing systems, 15.

[4] Xiao, Y., Liu, B., & Hao, Z. (2017). Multiple-instance ordinal regression. IEEE Transactions on Neural Networks and Learning Systems, 29(9), 4398-4413.

[5] Chu, W., & Keerthi, S. S. (2007). Support vector ordinal regression. Neural computation, 19(3), 792-815.

[6] Muandet, K., Fukumizu, K., Dinuzzo, F., & Schölkopf, B. (2012). Learning from distributions via support measure machines. Advances in neural information processing systems, 25.

mildsvm's People

Contributors

skent259 avatar

Stargazers

 avatar  avatar

mildsvm's Issues

Separate mildsvm into several packages?

Wondering whether mildsvm has become too large and would be better split into a few packages that work together:

  • multipleinstance: Contains the main methods for creating/using mi_df and mild_df objects.
  • misvm: Contains svm-based multiple instance methods, including misvm, omisvm, misvm_orova, mior, etc
  • (possibly) mildsvm: Contains the additional tools for distributional data. Could be wrapped into misvm
  • (possibly) kernelmaps: Contains the tools for kernel feature maps (nystrom mapping, exact mapping, etc)

Idea is still a work in progress and will likely change

Fix GitHub actions

Want both R CMD check and Coverage to work in GitHub actions.

Main problem is an error in the workflow with the Gurobi package. Details of the issue appear here: r-lib/actions#582, which is currently unresolved

Improve argument handling in model functions

There is a lot of repeated code that is copied and pasted to start most functions. As the complexity of this package grows, I think it might be helpful to have a standard argument handing function, with some potential sub-functions.

This may not shorten the code (such checking is likely annoying to do in R), but it could make things more manageable in the future.

Dual QP - find a better way to compute `b_`

In the kernel svm dual formation, the calculation of the intercept (b_) requires only using the alpha values that are in the interior of the $0 &lt; \alpha_i &lt; C$ constraint. The current implementation calculates b_ from alpha > 0, but I need to double the other side of this constraint and how the MI-SVM dual relates to that.

Camel case and snake case

Many objects in the package have a mix of CamelCase and snake_case. There are also some parameters and variable names with whatever.case.this.is. Stylistically, this is not ideal and should be addressed before the initial release.

Problems in order of priority:

  • function names
  • object classes
  • function file names (see #1)
  • function parameters
  • variables within functions

Warning message "setting levels" occurs consistently

I think this is coming from the call to pROC::auc or pROC::roc, but it would be nice to have this warning message not show up.

Quick-fix: use suppressWarnings
Better-fix: figure out what the input needs to have to remove the warning

Add open-source version of gurobi optimizations

Currently methods 'mip' and 'qp-heuristic' for mismv() and mildsvm() are only implemented in Gurobi, which is not open source. Consider adding an option to have these in an open-source implementation, if possible.

Related to #8

Add `mi_df()` function and class

This would be similar to the mild_df() function and class. Would include the following methods:

  • print.mi_df()
  • as.mi_df()

Want to allow the function misvm() and others to use this method for easy function calling

Potential issue in formulas with spaced colnames

I think that if a data frame has a space in the column name and a formula mi(y, bags) ~ . is used, the function will cause an error on some versions of R. Seems to work fine locally but it would be good to figure out which version this fails on.

Add testing for utils-model.R file

These were all refactored from old code, so I have some confidence that they work. However, best practice would be to add more testing of these directly and then reduce the testing in old model files that used to check for things like this.

Clean up documentation

  • Remove instances of ##' in favor of #'
  • Add @inhereitParams where appropriate
  • Make sure all functions have descriptions
  • Make sure all functions have examples
  • Make sure all functions have returns

Comparison methods function improvement

Allow the MI-SVM method that works on MilData to have a function list passed to it as an argument. Currently, the options are limited and require hard-coding to add new summary features.

Fix minor documentation issues

  • omisvm
    • control has extra parameters, need to adjust
    • weight only used in 'radial' kernel
  • predict.omisvm
    • needs ordinal in type
  • mior
    • method incorrect, should only have qp-heuristic
    • weights currently not used

Worth checking other documentation as well for minor errors

Fix `predict.mildsvm()` cutoffs

In practice, the predict method scores for mildsvm may be far off from the cutoff of 0. This means that it may be better to come up with a data-driven cutoff or add parameters to let the user specify the cutoff. Otherwise, there is a risk of poor performance

Restructure functions within files and rename

Currently, the function files have poor naming and the functions within them could be better organized.

Thinking to have the following files

  • mildsvm.R -> renamed from mil_distribution.R
  • misvm.R -> largely renamed from comparison_methods.R
  • smm.R -> renamed from SMM.R, include the smm methods from comparison_methods.R
  • generate_mild_data.R -> renamed from GenerateData.R
  • bkfm_functions.R -> renamed from feature_map.R once the bkfm family of functions gets built
  • kme.R, probably fine as it is, think about this one

Fix `print.omisvm()`

Should say 'omisvm' object, not 'misvm' object

Check other print methods also

Predict.mildsvm doesn't need new_data if kernel is passed

new_data is somewhat redundant when the kernel is passed, along with new_bags and new_instances. At the very least, new_data shouldn't need these columns. Update logic in predict.mildsvm to reflect this and add a few tests to check that it works

MI-SMM, MI-SVM methods create objects with different components

Need to unify the list components of these objects (as well as possible, ok to have some differences), and add a method parameter into the constructor. The method parameter gets used in the "predict" function to create a unified set of outputs.

Some initial steps to think about

  • Build mildsvm object first with the mild object in mind
  • Write predict.mildsvm with predict.mild in mind
  • Adjust everything that makes a mild object to be a mildsvm object
  • Incorporate the mild predict into mildsvm predict

Add a couple toy data sets

It would be ideal to have a couple of toy data sets that could simplify the examples and testing. Another perk is to standardize the data across the examples and make it easier to have meaningful toy data in them

select_cv_folds2() could be improved

Right now it doesn't provide exactly even folds when it could. This issue stems from evenly splitting the positive and negative bags separately, but could be improved

Improve efficiency of `cv_misvm()`

Currently, cv_misvm() doesn't use any efficiency in pre-computing the kernel matrix. It would be great to add in some logic that helped with that. Optimal would be only 1 kernel computation per value of cost. Note that `misvm(..., method="heuristic") currently does not support pre-computed kernel, and would need separate logic.

Add some scaling into feature map functions

When dataset is unscaled, passing to misvm directly, even with the scale = TRUE parameter, it doesn't get scaled in the kfm function. This causes a poorly scaled eigen-spectrum and it makes learning almost impossible for the svm.

One solution is to add the scaling directly into the feature map functions, which is ideal anyways.

Open Source version of MIP problem?

Currently, the MIP solution for MI-SVM and MI-SMM uses the gurobi backend, which is not open source. It would be nice to additionally have an open-source MIQP solver backend to use, even if it is a bit slower.

Problem: I can't find one in R. I checked the ROI package documentation (https://epub.wu.ac.at/5858/1/ROI_StatReport.pdf), but it doesn't seem like there is an R package that solves MIQP. I can only find MILP or QP.

No option to pass kernel parameters to misvm heuristic

Currently, in the dev-version there isn't a way to pass the kernel parameters to misvm that can run in the dual. It's unclear the best way to do this, besides adding them into the control and passing to e1071::svm ultimately

Add linear kernel method in kme

It seems possible that different kernels could be useful in kme() and the code is set up decently to allow for that. The big problem is that this would require a new parameter for many of the different methods, though maybe this could get pushed into the ... argument. Sigma get's passed to it anyways

Need to think about this more. I think that all of the functions that use kme have a kernel matrix that can be passed, so in theory, this issue isn't urgent. But it would be better to tackle before the first release if it's going to get tackled.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.