GithubHelp home page GithubHelp logo

ml4se-offside's Introduction

Learning Mistakes in Boundary Conditions: A First Exploration

Description: ๐Ÿ“ฐ

Mistakes in boundary conditions are the cause of many bugs in software. These mistakes happen when, e.g., developers make use of '<' or '>' in cases where they should have used '<=' or '>='. Mistakes in boundary conditions are often hard to find and manually detecting them might be very time-consuming for developers. While researchers have been proposing techniques to cope with mistakes in the boundaries for a long time, the automated detection of such bugs still remains a challenge. We conjecture that, for a tool to be able to precisely identify mistakes in boundary conditions, it should be able to capture the overall context of the source code under analysis. In this work, we propose a deep learning model that learn mistakes in boundary conditions and, later, is able to identify them in unseen code snippets. We train and test a model on over 1.5 million code snippets, with and without mistakes in different boundary conditions. Our model shows an accuracy from 55% up to 87%. The model is also able to detect 24 out of 41 real-world bugs; however, with a high false positive rate. The existing state-of-the-practice linter tools are not able to detect any of the bugs. We hope this paper can pave the road towards deep learning models that will be able to support developers in detecting mistakes in boundary conditions.

Data Collection: ๐Ÿ’พ

Setting Up the Environment: ๐Ÿ“‹

  • Install conda
  • Create a new environment:

conda env create -f environment.yml

  • Update environment:

    conda env update --file environment.yml --prune

  • No GPU support? Then replace

tensorflow-gpu==2.0.0-rc1 with tensorflow==2.0.0-rc1.

Extracting Custom Code2Vec Model From the Original Weights:

  • Download the pre-trained model.
  • Make sure this is the trainable model.
  • Unzip this folder in the resources folder in project root.
  • Download & unzip custom_model.zip to resources/models
  • Run main.py. The custom model can now be used in any tf graph.

Running the model on a single java file

Assuming you have setup the anaconda environment, you have downloaded the pre trained model and stored it at resources/models/custom3/model and you have a java file named Input.java. Then you can run the model using the following comand: python main.py --weights resources/models/custom3/model --input Input.java

Running the model on a java project.

  • Create a JAR as instructed in JavaExtractor folder
  • If there is not one, create a folder called data in project root
  • Change paths in the command in next step
  • Run the command in data folder java -cp ~/path/to/project/JavaExtractor/JPredict/target/JavaExtractor-0.0.2-SNAPSHOT.jar JavaExtractor.App --max_path_length 8 --max_path_width 2 --max_contexts 200 --dir path/to/java/project/to/test/ --evaluate > evaluate.txt
  • If you have extracted the model properly in previous steps you can run evaluate.py

Training the model

  • Before we can start the training process we need to encode the dataset into a numpy format. This can be done using the following command: python encode_data_set.py --dataset <path_to>/java-large.txt --output <path_to_data_folder>/ --prefix <train, val, or test>. This will encode the dataset into numpy by createing the following files path_source_token_idxs.npy, path_idxs.npy, path_target_token_idxs.npy, context_valid_masks.npy and Y.npy. We need to do this for both the training and validation set since both are needed to train a model.
  • Once we have encoded the training and validation set we can train the model using the following command: python train.py --trainset <path_to_data_folder>/<train_prefix> --valset <path_to_data_folder>/<val_prefix> --batch_size 1024> --pre_trained_weights <optional path_to_code2vec model> -freeze <True or False> --output <path_to_weight_output_folder>. Note that <path_to_data_folder>/<prefix> must be the same for the all the 5 created files in the encoding step as the this script loads them all automatically.

Testing the model.

The model can be tested in 2 ways.

  1. Using the command: validation_on_testset.py --weights <path_to_model_weights_folder> --dataset <path_to_data_folder>/<test_prefix> --threshold 0.5 --batch_size 1024. This can be used to test the model against a unseen test set. This script will output the confusion matrix, test_loss, accuracy, f1_score, precision_score and recall_score.
  2. Using the command: calc_prediction_stats.py --weights <path_to_model_weights_folder> --dataset <path_to_data_folder>/<test_prefix> --threshold 0.5 --batch_size 1024 --output <path>/stats.csv. This can be used to calculate the TP, TN, FP and FN per off-by-one type. The result will be stored in the stats.csv file.

ml4se-offside's People

Contributors

hsellik avatar j0rd1smit avatar jonbriem avatar pdrapoport avatar mauricioaniche avatar lizhmq avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.