GithubHelp home page GithubHelp logo

anneyinsj / fsbasedmlstream Goto Github PK

View Code? Open in Web Editor NEW

This project forked from blindreviewaaai18/fsbasedmlstream

0.0 1.0 0.0 65.78 MB

Max-Relevance and Min-Redundancy based Multi-label Data Stream Classification with Concept Drifting Detection

Java 100.00%

fsbasedmlstream's Introduction

FSbasedMultiLabelStreamClassification

Max-Relevance and Min-Redundancy based Multi-label Data Stream Classification with Concept Drifting Detection

Multi-label data stream classification is a very challenging and significant task especially in the handling of high-dimensional data streams with concept drifts. However, this challenge has received little attention from the research community. Therefore, we propose a max-relevance and min-redundancy based algorithm adaptation approach for the efficient and effective classification on multi-label data streams with high-dimensional attributes and concept drifts1. In order to reduce the impact from high-dimensional attributes with noisy attributes, we first refine the minimal-redundancy-maximal-relevance criterion based on mutual information to select qualified features. Secondly, we propose the label and feature distribution based concept drifting detection approach to distinguish concept drifts hidden in multi-label data streams. Finally, we build an incremental ensemble classification model for efficiently classifying multi-label data streams. Extensive studies show that our approach can get optimal subsets of features while maintaining a good performance in the multi-label classification, as compared to several state-of-the-art multi-label feature selection algorithms using two efficient multi-label classification methods as base classifiers.

Our Approach

Contrary to the above approaches, filter approaches are independent of any classification algorithm, and they usually evaluate the usefulness of a feature, or a set of features, through measures of distance (Reyes, CarlosMorell, and Ventura 2015), dependency, information or correlation on data (Lin et al. 2016). Thus, the biases of learning algorithms do not influence feature selection, and meanwhile they have the advantage of being fast and simple to implement. However, all aforementioned approaches are batch ones, and they mainly focus on improving the multi-label learning accuracy. Thus, they are unsuitable for handling multi-label data stream directly due to the lower efficiency, not to mention the handling of the hidden concept drifts. Therefore, in this paper we aim to design an efficient and effective classification approach based on feature selection for multi-label data stream with concept drifts. To the best of our knowledge, this is the first feature selection based classification approach for multi-label data streams with high dimensional features and concept drifts.

Our main contributions of this paper are as follows.

First, our approach can produce the higher accuracy of feature selection. In terms of advantages of the AA multilabel learning approach and the filter approach, we still aim at designing and implementing a novel extension-type filter FS approach for multi-label data stream classification. Unlike existing multi-label filter FS approaches (Lin et al. 2016), we use a sliding window to build an ensemble model incrementally for adapting to multi-label data streams, and then we give the analysis of generalization error of the ensemble model. Meanwhile, we extend the minimalredundancy-maximal-relevance criterion based on mutual information for single-label classification (Peng, Long, and Ding 2005) to multi-label data classification. This is because mutual information is a submodular function, which can provide a theoretical guarantee on the quality of a subset select- ed in the feature selection.

Second, our approach can detect concept drifts hidden in multi-label data streams. To track concept drifts hidden in multi-label data streams, we propose a concept drifting detection method based on the label distribution and the feature distribution. It is capable of capturing concept drifts in multi-label data streams effectively. Contrary to the classification-error based concept drifting detection method in the data stream classification such as (Gama et al. 2014; Frias-Blanco et al. 2015), we define the difference of data distributions between two adjoining data chunks, and then detect whether concept drifts occur due to the changing of the label distribution or the feature distribution.

Finally, our approach can perform efficiently in the handling of multi-label data streams. The model used here is incremental, the time cost is relevant to the size of a data chunk, while the time costs in aforementioned multi-label FS approaches depend on the size of the whole multi-label data set or the square value. Thus, our approach is more efficient and scalable.

Data Set

Benchark data sets: In our experiments, we select six large scale benchmark multi-label databases from different application domains to simulate the multi-label data stream. Details of these data sets are listed in Table 1, where Label- Cardinality is the average number of labels in a database while Label-Density is the average number of labels in a database divided by the label count L.

Experiment Results

Table 1 shows the benchmark data sets used in our experiments, you can download from the file list.

Table 1: DATA SETS USED IN THE EXPERIMENTS

Dataset domain Instances Attributes Labels Label Cardinality Laebl Density
train test discrete numerical
Mediamill vedio 30993 12914 0 120 101 4.376 0.043
IMDB-ECC-F Movie 76143 19281 1001 0 28 1.920 0.036
Corel16k010 images 13618 6660 500 0 144 2.834 0.017
NUS-WIDE images 161789 107859 0 500 81 1.869 0.023
EUR-Lex(subject matters) text 17414 1935 0 5000 412 2.213 0.011
bookmarks text 70045 17811 2150 0 208 2.028 0.010

You can download souce codes of the whole project here: Project Download

Base classifiers used in our approach are MLKNN (KNN based multi-label classification method) and MLRDT (Random Decision Tree based multi-label classification method). The souce code of MLKNN is from an open-source Java library for learning from Multi-label data, called Mulan . And the souce code of MLRDT is from the open souce, called the Dice project.

Our project is built on the mulan project. Source codes of our approach include the feature selection for multi-label data stream in the feasel zip file, and the ML_MRMR_FSClassification java file.

Parameter Discription

/******Parameter Discription***********/
"-alph": the threshold used in the selection on an optimal subset in MRMR based feature selection, default alph = 0.2;
"-blta": the threshold used in the drifting detection based on the class distribution: default blta = 0.2;
"-gamma": the threshold used in the drifting detection based on the feature distribution: default gamma = 0.2;
"-dataBlock": the size of a data chunk, default dataBlock = 200;
"-modelSize": the number of models in the ensemble model, default modelSize = 100;
"-path": the file directory;
"-arff": the source file;
"-test": the testing file:
"-attrSize": the size of feature space, namely the attribute count+the label count;
"-labelNum": the label count;
"-simElvType": the type of similarity evaluation, default value "Jaccard";
"-algType": the type of algorithms, default value "MLKNN";----it is useless if you select MLRDT as a base classifier;
"-bDiscretized": the flag of discretization, default "false";
"-bAvgVoting": the flag of voting, default "true";

How to install our approach

Please decompress the feasel zip file, and put this folder under the directory of "src" folder at mulan project. In our project, we put the file ML_MRMR_FSClassification.java in the folder of "/src/mulan/examples", and it has the main function. You can use the following demos to run our approach.

Demo: how to install our approach using MLRDT as the base classifier, in this case, we select the Corel16k010 data set as a demo data set;

public static void main(String[] args) throws Exception {
	/*********Classify by MLRDT after ML-MRMR-Feature selection**************/
	String[] comParms = {"-alph", "0.2", "-blta", "0.2", "-gamma", "0.2", "-dataBlock", "200", "-modelSize", "100"};
	ML_MRMR_FSClassification mcf = new ML_MRMR_FSClassification();
	mcf.InitComParms(comParms);
	String[] options = {"-path","H:/data/Corel16k010","-train","Corel16k010-train.arff-sort.arff","-test", "Corel16k010-test.arff","-xml","Corel16k010.xml", "-attrSize","644", "-labelNum","144", "-minS", "4", "-treeNum", "10", "-simElvType", "Jaccard", "-bDiscretized", "false", "-bAvgVoting", "true"};
	mcf.ML_MRMR_FS_ClassifyByMLRDT(options);
}

Demo: how to install our approach using MLKNN as the base classifier;

public static void main(String[] args) throws Exception {
      /****************Classify by mulan after ML-MRMR-Feature selection**********************/
      String[] comParms = {"-alph", "0.2", "-blta", "0.2", "-gamma", "0.2", "-dataBlock", "200", "-modelSize", "100"};
      ML_MRMR_FSClassification mcf = new ML_MRMR_FSClassification();
      mcf.InitComParms(comParms);
      String[] options = {"-path","H:/data/Corel16k010","-train","Corel16k010-train.arff-sort.arff","-test", "Corel16k010-test.arff","-xml","Corel16k010.xml","-attrSize","644", "-labelNum","144", "-simElvType", "Jaccard", "-algType", "MLKNN", "-bDiscretized", "false","-bAvgVoting", "true"};
      ML_MRMR_FSClassification mcf = new ML_MRMR_FSClassification();
      mcf.ML_MRMR_FS_ClassifyByMulan(options);
}

fsbasedmlstream's People

Contributors

blindreviewaaai18 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.