GithubHelp home page GithubHelp logo

collab-uniba / emse_best-answer-prediction Goto Github PK

View Code? Open in Web Editor NEW
6.0 5.0 1.0 52.17 MB

Dataset, scripts, and additional material for the EMSE submission "Best-Answer Prediction in Technical Q&A Sites"

License: MIT License

Shell 2.47% Python 13.18% R 84.35%
stackoverflow tuning best-answer-prediction learners dataset r question-answering questions-and-answers

emse_best-answer-prediction's Introduction

Dataset, scripts, and additional material for the paper "Best-Answer Prediction in Technical Q&A Sites"

DOI

Citation

F. Calefato, F. Lanubile, and N. Novielli (2018) โ€œAn Empirical Assessment of Best-Answer Prediction Models in Technical Q&A Sites.โ€ Empirical Software Engineering Journal, DOI: 10.1007/s10664-018-9642-5

@Article{Calefato2018,
  author="Calefato, Fabio
  and Lanubile, Filippo
  and Novielli, Nicole",
  title="An empirical assessment of best-answer prediction models in technical Q{\&}A sites",
  journal="Empirical Software Engineering",
  year="2018",
  month="Aug",
  day="07",
  issn="1573-7616",
  doi="10.1007/s10664-018-9642-5",
  url="https://doi.org/10.1007/s10664-018-9642-5"
}

Original data dumps

Original dumps refer to the data extracted "as is" from the following technical Q&A sites:

Download

The data dumps and the description of their file formats are available from here.

Experimental datasets

The datasets containing the features extracted from the data dump of each Q&A site are available for download here. A description of each feature is also avaialble.

Python and R scripts

Setup

To ensure proper execution, first run the following commands to check for the presence and eventually install all the required packages for R and Python.

$ RScript requirements.R
$ pip install -r requirements.txt

Automated parameter tuning

To start the automated parameter tuning via caret, run the run-tuning.sh script as described below.

$ run-tuning.sh models_file data_file
  • The models_file param indicates the file containing (one per line) a list of models (learners) to be tuned. See the file models/models.txt for an example.
  • The data_file param indicates the file containing the data to be used for the tuning stage.
  • As output, a TXT file will be created under the output/tuning/ subfolder for each tuned model, containing the best param configuration and execution times.

Note. The tuning step is very time consuming and will take several hours for each model; the more models in the input file, the longer the script will take to finish.

Default (untuned) model performance

To compute the default AUC performance with the default parameter setting is obtained running the script below.

$ sh run-default-predictions.sh path/to/input/so-dataset.csv path/to/models/models.txt
  • As output, the file output/untuned/AUC-all-models.txt will be created with the AUC values.

Note. The prediction step is very time consuming and will take several hours to complete; the more models in the input file and the larger the dataset chosen, the longer the script will take to finish.

Scott-Knott ESD model clustering

To cluster model by AUC performance into non-overlapping groups, run the following scripts:

$ python collect-metrics.py --in path/to/metrics/folder.txt --out outfile --ext file_extension --sep field_sep --runs N
  • path/to/metrics/folder.txt - where the tuning script stored the execution log per model for each run
  • outfile - the name of file where to store the following main metrics per model per run:
    • AUC
    • F1
    • G-mean
    • Balance
    • Time taken
  • file_extension - the extension of the output file, chosen in {txt, csv, xls}
  • field_sep - the character used to separate fields in the output file, either , or ;
  • N- the number of runs used in the tuning step (e.g., 10, 100)
$ Rscript skesd-test metrics_outfiles runsN 
  • metrics_outfiles - the file with metrics generated by the Python script at the previous step
  • runsN - the number of runs, must match the same param from the previous step

Feature selection

The following script perform wrapper-based feature selection using the R package Boruta; for the sake of completeness, it will also perform Correlation-based Feature Selection (CFS).

$ Rscript feature-selection.R dataset_file dataset_name featN
  • dataset_file - the dataset used for feature selection
  • dataset_name - the name of the dataset, chosen in {so, docusign, dwolla, scn, yahoo}; so by default
  • featN - the number of feature to select, 10 by default
  • As output, the script will generate the file output/feature-selection/feature-subset.txt containing:
    • The output of Boruta
    • The output of CFS, with both Spearman and Pearson correlation values

Prediction experiment

Once the models have been tuned, you can execute the best-answer prediction experiment. Run the run-predictions.sh script as described below.

$ run-predictions.sh training_file models_file data_file
  • The training_file param indicates the file containing the dataset for training the learners.
  • The models_file param indicates the file containing (one per line) a list of models (learners) to be used in the prediction experiment.
  • The data_file param indicates the file containing the test dataset.
  • As output, the following folder and files will be created:
    • output/cm - containing a TXT file for each test set and model with the confusion matrix
    • output/misclassifications - containing a TXT file for each test set and model with listing the cases where wrong predictions (errors) occurred
    • output/plots - containing a ROC plot image file for each test set and model specified as input

Note. Before running the prediction experiment, the file test.R must be manually edited in order customize the tuneGrid var (dataframe) containing the best param configuration for each learner model. As of now, the script contains the grids for the 4 models in the file models/top-cluster.txt.

Testing the scripts

When executed without running the .sh files (e.g., via RStudio or Rscript), these scripts by default open the test file input/example.csv, which contains a few hundred lines from the Stack Overflow dataset. This test file is intended to show how the scripts work in general and the output they produce. Beware of the longest execution time when running the scripts with the other input files.

emse_best-answer-prediction's People

Contributors

bateman avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

fainafr

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.