GithubHelp home page GithubHelp logo

dasarijayanth / malware-detection-in-pe-files-using-machine-learning Goto Github PK

View Code? Open in Web Editor NEW
18.0 1.0 8.0 137.03 MB

Detecting Malware in PE files

Jupyter Notebook 71.86% Assembly 26.97% Python 1.17%
machine-learning python-3 pefile malware-analysis malware-detection hybridstaticanalysis staticanalysis ngrams opcodengrams bytengrams

malware-detection-in-pe-files-using-machine-learning's Introduction

Malware-Detection-in-PE-files-using-ML

Every thing about this project is explained in detail in FER(Final Evaluation report).

Project Description

This project aims to detect malware in PE (Portable Executable) files using Machine Learning techniques. We have developed a model that analyzes PE files and predicts whether they contain malware or not using hybrid static malware analysis(combination of PE Headers, byte-n-grams and opcode-n-grams features). This project can be a valuable tool in enhancing cybersecurity measures and protecting systems against malicious software/files.

Dataset(s):

  1. PE files csv, containing metadata, header information Dataset.
  2. byte and asm raw files, from kaggle microsoft malware classification challenge (BIG 2015) Dataset.

Preprocessing/Feature Extraction:

For Dataset-1:

  1. Dataset already has values for the features.
  2. Created a script to extract those features and header information from the given PE files like .exe, .dll file types.
  3. Used Extra-Trees classifier for the feature selection, important feature set from all the available information.

For Dataset-2:

  1. Dataset has raw byte and asm files. Created seperate directories for each type and extracted file size as a feature for each file.
  2. Extracted N-grams from byte(byte-n-grams, where n= 1,2) and asm files(prefixes/keywords/registers/opcode-n-grams, where n= 1,2,3,4) as the features from each file.
  3. Converted asm files to image and extracted top performing 200 image pixels as features from that image.
  4. Used Random Forest for important features selection from all the above features separately for each feature set and merged them.

Final dataset contains the following features.

  1. PE Header dataset
  2. Byte unigrams
  3. Opcode unigrams
  4. Top 300 Byte bigrams
  5. Top 200 Opcode bigrams
  6. Top 200 Opcode trigrams
  7. Top 200 Opcode tetragrams
  8. Top 200 Image Pixels

Training (build/train/evaluate)

Trained various ML models on the above final dataset for the classification of files into malware/benign.
Evaluation metrics used are accuracy, f1 score, confusion matrix.
Random Forest model performed best among others like Gradient Boost, SVM.

you can download the trained Random Forest model here.

Setup

Clone the repository to your local machine:

git clone https://github.com/DasariJayanth/Malware-Detection-in-PE-files-using-Machine-Learning.git  

Once you cloned the repository create a virtual environment using

python3 -m venv .venv

you might be required to set the policies to authorize the acivation of env

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser  

Activate the environment:

source .venv/bin/activate

Next install the required libraries using:

pip install -r requirements.txt

Perform Feature extraction on your data as done in the PE_Header(exe, dll files)/malware_test.py and Ngrams(byte, asm files)/N-grams.ipynb. Also refer Malware Detection Model.ipynb for merging both feature sets before predicting with the model.

Load the models/RF_model.pkl and run the loaded model on the extracted features for prediction.

After you are done, Deactivate the virtual environment:

deactivate  

malware-detection-in-pe-files-using-machine-learning's People

Contributors

dasarijayanth avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.