MALIMG2022: DATA AUGMENTATION AND TRANSFER LEARNING TO SOLVE IMBALANCED TRAINING DATA FOR MALWARE CLASSIFICATION
Dataset are available at : https://www.kaggle.com/ikrambenabd/datasets
Data augmentation is creating new images by transforming old ones. It’s used to solve imbalanced image classification problem in many domains. Usually, data augmentation is used when we are unable to get more data for underrepresented classes. So, data augmentation techniques help us to increase the size of training data in order to avoid any bias in the classifier. This paper’s main contribution is to developed a balancing tool for any imbalanced multiclass database. Then, we use this approach in application to the Malimg database to improve its effectiveness and speed to solve imbalanced data problems. As a result, we generated 2 versions of Malimg database namely Malimg2022 (Large and XXLarge). These new versions are balanced, having same number of samples per class using data augmentation with different transformation techniques. From a technical point of view, Zero-day malwares are none than old ones with few modifications, so data augmentation could be seen as a simulation of new malware variants that should be detected effectively. Finally, the new balanced data were evaluated using transfer learning.
Full paper details available at : http://www.jatit.org/volumes/Vol101No4/1Vol101No4.pdf