Sequential Minimal Optimization

The main purpose of the project is to analyse the optimization methods currently employed to perform the training of the support vector machines (SVM).

The scope of the project

The first part of the project consisted in studing the three following papers:

Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines, Platt 1998;
Making Large-Scale SVM Learning Practical, Joachims 1998;
Working Set Selection Using Second Order Information for Training Support Vector Machines, Fan, Chem, Lin 2005;

The first article analyses the Sequential Minimal Optimization method (SMO) proposed by Platt.
Whereas in the second and the third, the authors proposed two different improvements of the SMO that, respectively, employ a 1^st order and 2^nd order method to select the Lagrange multipliers (LMs) that composed the working set.

Implementation

In the second stage of the project, the algorithms have been developed by using Matlab.
For the sake of completeness, in addition to the Sequential Minimal Optimization methods proposed in the papers, two more versions of the SMO have been implemented, in particular:

"Improvements to Platt's SMO algorithm for SVM Classifier Design", Keerthi 2001;
The original version proposed by Platt without using an auxiliary cache to store prediction error;

As in the implementation proposed by Joachims, Keerthi employed a 1^st order method to select the LMs. However the latter avoids to use an external quadratic solver to solve the optimization steps.

Although it might be interesting to further analyze both:

The impact of the cache in the SMO version proposed by Platt;
The performance of the Keerthi's SMO with respect to the Joachims' SMO that uses a working set with size equals 2;

These considerations will not be taken into account in the continuation of the project

The folder Algorithms collects all the implementations cited below:

Platt's version without Error Cache;
Platt's version with Error Cache;
Keerthi (1^st order method);
Joachims (1^st order method);
Fan Chen and Lin (2^nd order method);

Testing and analysis

The third part of the project is the test and analysis stage. This phase is clearly divided in two sub-steps:

A first set of tests is focused on three different (bi-dimensional) datasets that were artificially generated. The main goal of this first test phase is to assess the correctness of the implementations. Indeed, it was easy to detect errors and imprecisions in the implementations, testing the algorithms against this small-size and known datasets.
A second series of tests is focused on six datasets of more substantial size and with a number of features greater than two. The aim of this second set of tests is to compare how a 1^st order SMO and a 2^nd order SMO perform in the training phase.

To develop the 2^nd point, the training phase is repeated using 3 different Sequental Minimal Optimization implementations:

Joachims'version with Working Set Size (WSS) equals 4;
Joachims'version with WSS equals 6;
Fan Chen and Lin' implementation;

The statistics collected in the second series of test were used to gain a better understanding how these different training techniques behave.

First series of tests

In this phase the algorithms were tested both on perfectly separable and non-perfcetly separable data.
A sample of the artificial dataset used is shown in the images below:

Perfectly separable data: Non-perfcetly separable data:

The results obtained, training the algorithms on these artificial datasets, confirm the correctness of their implementation.
To check the results, another model has been trained using the Matlab function fitcsvm. To be coherent, all the relevant parameters (training algorithm, tolerance, maxIter, BoxConstraint, sigma) were set to be consistent with those used by the implementations under testing. Then the model generated is used as baseline to check if its results are consistent with other models.

Furthermore during the training phase different statistics are collected. Among these:

Number of iterations performed;
Training and prediction time;
Average number of kernel evaluations per iteration;
Number of support vectors generated;

Since these tests are more focused on testing the overall correctness of the algorithms the results obtained are not shown here. Anyhow, the interested reader can find all the results and statistics concerning this collection of tests in the folder TEST RESULT.

Second series of tests

The following statistics briefly describe the datasets used during this phase:

diabetes (8 features, 576 instances)
ringnorm (20 features, 6500 instances)
magic (10 features, 17118 instances)
a9a (124 features, 12682 instances)
codrna (8 features, 59535 instances)

The complete datasets used are available in the Dataset folder.

To analyse the performances of the 1^st and 2^nd order Sequential Minimal Optimization techniques, the methods under investigation have been tested under a large number of training settings (different dataset as well as different training parameters) in order to gain a better overview of their behavior. To do this, I repeated a Grid search over the parameters C and sigma (gaussian kernel variation) for each implementation and dataset.

As for the previous serie of test, I collected several statistics that are available in the folder TEST RESULT. Some of these are:

Total number of iteration
Training time
Total number of kernel evaluation
Number of support vector generated
Several metrics about the performance of the models generated
- Accuracy
- Sensitivity
- Specificity
- Precision
- Recall
- F_measure
- G_mean

Thereafter these statistics are analyzed to show how the optimization methods respond to changes of the specified parameters.

Before showing the results of the analysis, two considerations have to be taken into account:

Based on the size of the dataset under consideration, a different number of combinations of parameters is analysed during grid search. Notice how for the last dataset (given its high number of instances) only the C parameter is investigated during grid search.
To obtained statistics concerning the performances of the models generated, these have been tested against the test set. However, to validate the model using the test set should be considered as a bad practice. The main porpouse of this project is not to generate the best possible model for a given set of data, but to analyse the performance of the different implementations. It is relevant to note that these are not affected by the way we validate the models.
On the other hand, this choice gave an unbiased way to evaluate the models generated without the burden to repeat the training procedure to perform cross-validation. It is important to underline how this choice allowed to significantly shorten the overall training procedure time.

The following pictures collect the results obtained during this series of tests.
For each dataset two different kinds of graphs are presented:

A 3D plot that for each method shows:
- Number of iterations;
- Accuracy;
- Number of support vector generated;
A set of plots that compare how the implementations perform as C and sigma change.

DIABETES

Fan Chen and Lin

Joachims with working set size equals 4

Joachims with working set size equals 6

                                                    C = 2^-5

                                                    C = 2^-3

                                                    C = 2

                                                    C = 2⁷

                                                    C = 2⁹

                                                    C = 2¹⁵

RINGNORM

Fan Chen and Lin

Joachims with working set size equals 4

Joachims with working set size equals 6

                                                    C = 2^-5

                                                    C = 2^-3

                                                    C = 2

                                                    C = 2⁷

                                                    C = 2⁹

MAGIC

Fan Chen and Lin

Joachims with working set size equals 4

Joachims with working set size equals 6

                                                    C = 2^-3

                                                    C = 2

                                                    C = 2⁹

A9A

                                                    C = 2^-3

                                                    C = 2

                                                    C = 2⁷

CODRNA

Due to the high number of instances that compose this dataset, would be to expensive to perform a complete gid search (over both C and sigma) as done in the previous dataset. As a consequence a fixed sigma is heuristically selected (using the Jaakkola heuristic) and a grid search is performed only over the C parameter.
The Sigma parameter selected is equal to 3.2497

                                                      sigma = 3.2497

Conclusion

Thanks to the data collected during the training phases, we can infer interesting characteristics on the behavior of the Sequential Minimal Optimization algorithm when combine with first or second order working set selection methods.

First of all, note how the datasets analyzed have different characteristics in term of number of instances and features. They are reported below in increasing order of complexity (number of fetaures and instances):

diabetes (8 features, 576 instances)
ringnorm (20 features, 6500 instances)
magic (10 features, 17118 instances)
a9a (124 features, 12682 instances)
codrna (8 features, 59535 instances)

The data collected highlight how the type of dataset on which the training is carried out has a strong impact on the performance of the Sequential Minimal Optimization algorithm.

Indeed, we can see how for datasets composed of a relatively low number of instances or features the first order method has better performance in terms of training time than those obtained using a second order method. This is visible for the following three datasets:

diabetes (8 features, 576 instances)
ringnorm (20 features, 6500 instances)
magic (10 features, 17118 instances)

It is important to underline that the Matlab implementations of the algorithms derived exclusively from the pseudocodes present in the papers. Because of this the implementations obtained do not take into consideration any optimization to speed up the process. As a consequence the training time obtained could be influenced by this issue.

Analysing the performance obtained in the remaining two datasets:

a9a (124 features, 12682 instances)
codrna (8 features, 59535 instances)

we can deduce as for datasets characterized by a greater number of instances and features, the choice of using the second order method gives us better performance, as expected.

To conclude we can say that the data collected throughout this project confirm in part the thesis that the Sequential Minimal Optimization algorithm, equipped with a working set selection of the second order, generally gives better performances than using the first order approach.

On the other hand, we must also note how the performance gap obtained by using these two methods is accentuated for datasets characterized by a high number of instances and features.

Bibliography

Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines, Platt 1998
Making Large-Scale SVM Learning Practical, Joachims 1998
Working Set Selection Using Second Order Information for Training Support Vector Machines, Fan, Chem, Lin 2005
Improvements to Platt's SMO algorithm for SVM Classifier Design Keerthi 2001
On the Convergence of the Decomposition Method for Support Vector Machines Chih-Jen Lin
SVMlight library documentation
Data Mining and Analysis: Fundamental Concepts and Algorithms (Wagner Meira, Mohammed J. Zaki)
A Practical Guide to Support Vector Classification Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin

Author

Giovanni Vignocchi

This project has been carried out in the context of the course of Nonlinear optimization I have attended in Politecnico di Milano.

klevajoe / sequential-minimal-optimization-for-svm Goto Github PK