FSbasedMultiLabelStreamClassification

Max-Relevance and Min-Redundancy based Multi-label Data Stream Classification with Concept Drifting Detection

Multi-label data stream classification is a very challenging and significant task especially in the handling of high-dimensional data streams with concept drifts. However, this challenge has received little attention from the research community. Therefore, we propose a max-relevance and min-redundancy based algorithm adaptation approach for the efficient and effective classification on multi-label data streams with high-dimensional attributes and concept drifts1. In order to reduce the impact from high-dimensional attributes with noisy attributes, we first refine the minimal-redundancy-maximal-relevance criterion based on mutual information to select qualified features. Secondly, we propose the label and feature distribution based concept drifting detection approach to distinguish concept drifts hidden in multi-label data streams. Finally, we build an incremental ensemble classification model for efficiently classifying multi-label data streams. Extensive studies show that our approach can get optimal subsets of features while maintaining a good performance in the multi-label classification, as compared to several state-of-the-art multi-label feature selection algorithms using two efficient multi-label classification methods as base classifiers.

Our Approach

Contrary to the above approaches, filter approaches are independent of any classification algorithm, and they usually evaluate the usefulness of a feature, or a set of features, through measures of distance (Reyes, CarlosMorell, and Ventura 2015), dependency, information or correlation on data (Lin et al. 2016). Thus, the biases of learning algorithms do not influence feature selection, and meanwhile they have the advantage of being fast and simple to implement. However, all aforementioned approaches are batch ones, and they mainly focus on improving the multi-label learning accuracy. Thus, they are unsuitable for handling multi-label data stream directly due to the lower efficiency, not to mention the handling of the hidden concept drifts. Therefore, in this paper we aim to design an efficient and effective classification approach based on feature selection for multi-label data stream with concept drifts. To the best of our knowledge, this is the first feature selection based classification approach for multi-label data streams with high dimensional features and concept drifts.

Our main contributions of this paper are as follows.

First, our approach can produce the higher accuracy of feature selection. In terms of advantages of the AA multilabel learning approach and the filter approach, we still aim at designing and implementing a novel extension-type filter FS approach for multi-label data stream classification. Unlike existing multi-label filter FS approaches (Lin et al. 2016), we use a sliding window to build an ensemble model incrementally for adapting to multi-label data streams, and then we give the analysis of generalization error of the ensemble model. Meanwhile, we extend the minimalredundancy-maximal-relevance criterion based on mutual information for single-label classification (Peng, Long, and Ding 2005) to multi-label data classification. This is because mutual information is a submodular function, which can provide a theoretical guarantee on the quality of a subset select- ed in the feature selection.

Second, our approach can detect concept drifts hidden in multi-label data streams. To track concept drifts hidden in multi-label data streams, we propose a concept drifting detection method based on the label distribution and the feature distribution. It is capable of capturing concept drifts in multi-label data streams effectively. Contrary to the classification-error based concept drifting detection method in the data stream classification such as (Gama et al. 2014; Frias-Blanco et al. 2015), we define the difference of data distributions between two adjoining data chunks, and then detect whether concept drifts occur due to the changing of the label distribution or the feature distribution.

Finally, our approach can perform efficiently in the handling of multi-label data streams. The model used here is incremental, the time cost is relevant to the size of a data chunk, while the time costs in aforementioned multi-label FS approaches depend on the size of the whole multi-label data set or the square value. Thus, our approach is more efficient and scalable.

Data Set

Benchark data sets: In our experiments, we select six large scale benchmark multi-label databases from different application domains to simulate the multi-label data stream. Details of these data sets are listed in Table 1, where Label- Cardinality is the average number of labels in a database while Label-Density is the average number of labels in a database divided by the label count L.

Experiment Results

Table 1 shows the benchmark data sets used in our experiments, you can download from the file list.

Table 1: DATA SETS USED IN THE EXPERIMENTS

Dataset	domain	Instances		Attributes		Labels	Label Cardinality	Laebl Density
Dataset	domain	train	test	discrete	numerical	Labels	Label Cardinality	Laebl Density
Mediamill	vedio	30993	12914	0	120	101	4.376	0.043
IMDB-ECC-F	Movie	76143	19281	1001	0	28	1.920	0.036
Corel16k010	images	13618	6660	500	0	144	2.834	0.017
NUS-WIDE	images	161789	107859	0	500	81	1.869	0.023
EUR-Lex(subject matters)	text	17414	1935	0	5000	412	2.213	0.011
bookmarks	text	70045	17811	2150	0	208	2.028	0.010

You can download souce codes of the whole project here: Project Download

Base classifiers used in our approach are MLKNN (KNN based multi-label classification method) and MLRDT (Random Decision Tree based multi-label classification method). The souce code of MLKNN is from an open-source Java library for learning from Multi-label data, called Mulan . And the souce code of MLRDT is from the open souce, called the Dice project.

Our project is built on the mulan project. Source codes of our approach include the feature selection for multi-label data stream in the feasel zip file, and the ML_MRMR_FSClassification java file.

Parameter Discription

/******Parameter Discription***********/

"-alph": the threshold used in the selection on an optimal subset in MRMR based feature selection, default alph = 0.2;

"-blta": the threshold used in the drifting detection based on the class distribution: default blta = 0.2;

"-gamma": the threshold used in the drifting detection based on the feature distribution: default gamma = 0.2;

"-dataBlock": the size of a data chunk, default dataBlock = 200;

"-modelSize": the number of models in the ensemble model, default modelSize = 100;

"-path": the file directory;

"-arff": the source file;

"-test": the testing file:

"-attrSize": the size of feature space, namely the attribute count+the label count;

"-labelNum": the label count;

"-simElvType": the type of similarity evaluation, default value "Jaccard";

"-algType": the type of algorithms, default value "MLKNN";----it is useless if you select MLRDT as a base classifier;

"-bDiscretized": the flag of discretization, default "false";

"-bAvgVoting": the flag of voting, default "true";

How to install our approach

Please decompress the feasel zip file, and put this folder under the directory of "src" folder at mulan project. In our project, we put the file ML_MRMR_FSClassification.java in the folder of "/src/mulan/examples", and it has the main function. You can use the following demos to run our approach.

Demo: how to install our approach using MLRDT as the base classifier, in this case, we select the Corel16k010 data set as a demo data set;

public static void main(String[] args) throws Exception {
	/*********Classify by MLRDT after ML-MRMR-Feature selection**************/
	String[] comParms = {"-alph", "0.2", "-blta", "0.2", "-gamma", "0.2", "-dataBlock", "200", "-modelSize", "100"};
	ML_MRMR_FSClassification mcf = new ML_MRMR_FSClassification();
	mcf.InitComParms(comParms);
	String[] options = {"-path","H:/data/Corel16k010","-train","Corel16k010-train.arff-sort.arff","-test", "Corel16k010-test.arff","-xml","Corel16k010.xml", "-attrSize","644", "-labelNum","144", "-minS", "4", "-treeNum", "10", "-simElvType", "Jaccard", "-bDiscretized", "false", "-bAvgVoting", "true"};
	mcf.ML_MRMR_FS_ClassifyByMLRDT(options);
}

Demo: how to install our approach using MLKNN as the base classifier;

public static void main(String[] args) throws Exception {
      /****************Classify by mulan after ML-MRMR-Feature selection**********************/
      String[] comParms = {"-alph", "0.2", "-blta", "0.2", "-gamma", "0.2", "-dataBlock", "200", "-modelSize", "100"};
      ML_MRMR_FSClassification mcf = new ML_MRMR_FSClassification();
      mcf.InitComParms(comParms);
      String[] options = {"-path","H:/data/Corel16k010","-train","Corel16k010-train.arff-sort.arff","-test", "Corel16k010-test.arff","-xml","Corel16k010.xml","-attrSize","644", "-labelNum","144", "-simElvType", "Jaccard", "-algType", "MLKNN", "-bDiscretized", "false","-bAvgVoting", "true"};
      ML_MRMR_FSClassification mcf = new ML_MRMR_FSClassification();
      mcf.ML_MRMR_FS_ClassifyByMulan(options);
}

anneyinsj / fsbasedmlstream Goto Github PK

fsbasedmlstream's Introduction

FSbasedMultiLabelStreamClassification

Our Approach

Data Set

Experiment Results

You can download souce codes of the whole project here: Project Download

Parameter Discription

How to install our approach

fsbasedmlstream's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs