GithubHelp home page GithubHelp logo

brenomiranda / jtec Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jtecdataset/jtec

0.0 0.0 0.0 75 KB

Companion page for the JTeC dataset (http://doi.org/10.5281/zenodo.3711509). Screencast demonstrating the use of the dataset (https://youtu.be/V5mQPkOiJEo).

License: GNU General Public License v3.0

Shell 0.61% Python 99.39%

jtec's Introduction

JTeC: A Large Collection of Java Test Classes for Test Code Analysis and Processing

This repository is the companion for the dataset:

F. Corò, R. Verdecchia, E. Cruciani, B. Miranda, and A. Bertolino, "JTeC: A Large Collection of Java Test Classes for Test Code Analysis and Processing". DOI

It contains the implementation of all the steps required in order generate our dataset, including: (i) filtering of GitHub repositories, (ii) Java repository selection, (iii) test classes identification, (iv) repository selection, (v) local storage of test classes, and (vi) quality filtering.

It also contains the quality filter script quality_filter.py that can be used to explore and trim the dataset according to some iser-defined quality criteria.

You can cite the dataset in BibTeX via

@misc{JTeC2019,
  author       = {Corò, Federico and
                  Verdecchia, Roberto and
                  Cruciani, Emilio and
                  Miranda, Breno and
                  Bertolino, Antonia},
  title        = {{JTeC: A Large Collection of Java Test Classes 
                   forTest Code Analysis and Processing}},
  month        = may,
  year         = 2019,
  note         = {{Companion page for the JTeC dataset at 
                   https://github.com/JTeCDataset/JTeC}},
  doi          = {10.5281/zenodo.3711509},
  url          = {https://doi.org/10.5281/zenodo.3711509}
}

Dataset replication

In order to replicate the dataset follow these steps:

  1. Clone the repository

    • git clone https://github.com/JTeCDataset/JTeC
  2. Make sure to satisfy the following requirements:

    • Have Python 3.0+ installed
    • Possess a valid GitHub username and personal GitHub access token
  3. Modify the file token.txt by changing the fields to your personal GitHub username and access token

  4. Execute the script which launches in sequential order the JTeC generation steps (see Section "JTeC generation steps")

    • sh JTeC_generator.sh
  5. Run the quality filter script

    • python3 quality_filter.py

JTeC generation steps

The steps required in order to generate the dataset are implemented in the following 4 scripts, which have to be executed sequentially in the order given below. A brief description of the scripts is provided below:

  1. repository_filtering.py - Script generating an index of GitHub public repositories (Step 1).
    The final output of this script consists of a local .csv file containing for each public repository indexed the following fields: repositoryID, username of repository creator, name of the repository, and programming languages associated to the repository

  2. selection_test_count.py - Script selecting Java repositories (Step 2) and identifying test classes of the selected repositories (Step 3). This script takes as parameter the programming language to be considered for the generation of the dataset, e.g. selection_test_count.py Java.
    The final output of this script consists of a local .csv file containing the following information: user, repository, id, hash, date, n_tests, fork_id.

  3. select.py - Script selecting among each forked project either the original or forked project according to which one contains more test classes (Step 4).
    The final output of this script consists of a local .csv file containing the following information: user, repository, id, hash, date, n_tests, fork_id.

  4. download_tests.py - Script downloading the test classes of the repositories selected by select.py (Step 5).
    This script takes as input the list of repositories for which we want to download the test classes and create the dataset.
    The final output of this script is: (i) the totality of the source code of the identified test classes, and (ii) a .csv file containing the following fields: user, repository, id,fork_id,hash, date, n_tests, SLOC, size

  5. quality_filter.py - Script cleaning the raw dataset obtained after previous step. Next section describes the script more in detail.

JTeC Quality Filter

JTeC provides a simple method to trim the dataset in order to make it satisfy some quality criteria, e.g., test suite size measured by number of test classes in each the test suite. The criteria can be customized by simply changing a configuration file config.json.

Configuration file parameters

Customizable variables in the configuration file config.json:

  • BOOL_TS_Clone: Copy Output Dataset In New Folder (Values: true, false)
  • BOOL_TS_Index: Create Test Suite Index (Values: true, false)
  • BOOL_TS_Original: Select Original Projects (Values: true, false)
  • BOOL_TS_Fork: Select Fork Projects (Values: true, false)
  • MIN_TS_Year: Lower Bound on Project's Test Suite Years Range (Values: 0,1,2,...)
  • MAX_TS_Year: Upper Bound on Project's Test Suite Years Range (Values: 0,1,2,...; Unbounded: -1)
  • MIN_TS_Size: Lower Bound on Total Number of Project's Test Cases - Test Suite Size (Values: 0,1,2,...)
  • MAX_TS_Size: Upper Bound on Total Number of Project's Test Cases - Test Suite Size (Values: 0,1,2,...; Unbounded: -1)
  • MIN_TS_SLOCs: Lower Bound on Total Number of SLOCs of Project's Test Suite (Values: 0,1,2,...)
  • MAX_TS_SLOCs: Upper Bound on Total Number of SLOCs of Project's Test Suite (Values: 0,1,2,...; Unbounded: -1)
  • MIN_TS_Bytes: Lower Bound on Total Number of Bytes of Project's Test Suite (Values: 0,1,2,...)
  • MAX_TS_Bytes: Upper Bound on Total Number of Bytes of Project's Test Suite (Values: 0,1,2,...; Unbounded: -1)

Quality Filter Script

After having customized the configuration file, run the quality filter script quality_filter.py via python3 quality_filter.py

Utility files

In addition to the scripts described in Section "JTeC generation steps", the dataset generation process makes use of two utility scripts and one utility file, namely:

  • request_manager.py - Script managing all GitHub requests and handling possible error arising at request time, returning eventually a specific error-number to the script that first sent the request.
  • credentials.py - Script loading from the file tokens.txt the username and access tokens required to query the GitHub API.
  • tokens.txt - Text file containing the GitHub username and personal GitHub access token.

jtec's People

Contributors

jtecdataset avatar msr19-jtec avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.