GithubHelp home page GithubHelp logo

antivirus-1's Introduction

====================
Antivirus 
====================

An antivirus program written in Java that can scan a file and detect if it is
a virus using Bayesian analysis.


USAGE
--------------------
There are 5 buttons:

1) Open Directory - Choose a directory. If "virusdb.ser" exists in the 
directory, the previous save state will automatically be loaded. Otherwise,
a new database will be created at runtime.

2/3) Learn Benign Files/Viruses - Choose a directory containing the known
viruses/benign files in order to train the program.

4)Clear Database - Clears the current working database and chosen directory. 
No files will be deleted.

5) Scan File - Choose a file. The program will then scan the file and calculatethe ratio of virus/benign based on the PROBABILITY CALCULATION method below. 
Then, the program predicts whether the file is a virus or not based on the
ratio.


In order to use the program, you have to train it. Start by clicking "Learn Benign Files" and "Learn Viruses." These buttons will prompt you to choose a 
directory, in which the  known viruses/normal files are stored. Then, the 
program will scan the files and count the n-grams for each file (my program
uses 4 character sequences). When the program is learning, there will be no
output until the end. For some reason, it waits until the end of the learning
to print anything to the console. It may take up to 5 secs for the program to
finish and it will prompt you when it is done.

On exit, the program will ask you if you want to save. If you want to save, 
you must first choose a directory by clicking "Open Directory." The serialized
data will be saved as "virusdb.ser" in the chosen directory.

The top-right panel contains the current directory as well as the number of
files that have been used to train the program in the current session.


PROBABILITY CALCULATION
-----------------------
I calculated probabiilites using this method:

http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Document_Classification

When a file is scanned, I compute the natural log of the ratios. The formula
is as follows:

ln[(p(virus|file)/p(not virus|file)] = sum[p(word|virus)/p(word|not virus)]

If the sum of the logs is greater than 0, then the file is a virus. If the 
sum is less than 0, then the file is benign.

N-grams that have not been seen in the training phase are skipped.

Overall, this method is okay at categorizing files. There are quite a few fals negatives, meaning that virus files are classified as benign.
I believe that this is caused by the unevenness of the two training
directories. Although there are more virus files, there are more n-grams in thebenign directory. Therefore, the counts are generally higher in the benign hashtable, skewing the results a bit towards the benign side in cases where viruses

OTHER INFO
----------------------
When the state is saved as "virusdb.ser", the VirusDB object is serialized.
VirusDB contains two hash tables, one virus and one benign files, a list of thefiles used for training, the number of files used for training, and the 
directory the file was saved in.

antivirus-1's People

Contributors

mik854e avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.