GithubHelp home page GithubHelp logo

greg4cr / sbse-sigsoft-standard Goto Github PK

View Code? Open in Web Editor NEW
23.0 7.0 6.0 436 KB

Proposed ACM SIGSOFT Standard for Optimization Studies in SE (including SBSE).

Home Page: https://arxiv.org/abs/2010.03525

sbse-sigsoft-standard's Introduction

Optimization Studies in SE (including Search-Based Software Engineering)

Research studies that focus on the formulation of software engineering problems as search problems, and apply optimization techniques to solve such problems1.

Application

This standard applies to empirical studies that meet the following criteria:

  • Formulates a software engineering task2 as an optimization problem, with one or more specified fitness functions3 used to judge success in this task.
  • Applies one or more approaches that generate solutions to the problem in an attempt to maximize or minimize the specified fitness functions.

Specific Attributes

We stress that the use of optimization in SE is still a rapidly evolving field. Hence, the following criteria are approximate and there may exist many exceptions to them.

Essential

  • Describe the search space (e.g., constraints, independent variables choices).
  • Explain why the problem cannot be optimized manually or by brute force within a reasonable timeframe4.
  • Use realistic and limited simplifications and constraints for the optimization problem. Simplifications and constraints must not reduce the search to one where all solutions could be enumerated through brute force.
  • EITHER include a description of prior state of the art in this area, OR carefully motivate and define the problem tackled and the solution proposed.
  • Justify the choice of algorithm5 underlying an approach6.
  • Compare approaches to a justified and appropriate baseline7.
  • Explictly define the solution formulation, including a description of what a solution represents8, how it is represented9, and how it is manipulated.
  • Explicitly define all fitness functions, including the type of goals that are optimized and the equations for calculating fitness values.
  • Explicitly define evaluated approaches, including the techniques, specific heuristics, and the parameters and their values10.
  • EITHER follow and clearly describe a sound process to collect and prepare the datasets used to run and to evaluate the optimization approach and make data publicly available or explain why this is not possible11, OR, if the subjects are taken from previous work, fully reference the original source and explain whether any transformation or cleaning was applied to the datasets.
  • Identify and explain all possible sources of stochasticity12.
  • EITHER execute stochastic approaches or elements multiple times OR explain why this is not possible13.

Desirable

  • Motivate the novelty and soundness of the proposed approach14.
  • Explain whether the study explores a new problem type (or a new area within an existing problem space), or how it reproduces, replicates, or improves upon prior work.
  • Explain in detail how subjects or datasets were collected/chosen to mitigate selection bias and improve the generalization of findings.
  • Describe the main features of the subjects used to run and evaluate the optimization approach(es) and discuss what characterizes the different instances in terms of "hardness".
  • Justify the use of synthetic data (if any); explain why real-world data cannot be used; discusses the extent to which the proposed approach and the findings can apply to the real world.
  • Make available a replication package that conforms to SIGSOFT standards for artifacts15.
  • If data cannot be shared, create a sample dataset that can be shared to illustrate the approach.
  • Select a realistic option space for formulating a solution. Any values set for attributes should reflect one that might be chosen in a "real-world" solution, and not generated from an arbitrary distribution.
  • Justify the parameter values used when executing the evaluated approaches (and note that experiments trying a wide range of different parameter values would be extraordinary, see below).
  • Sample from data multiple times in a controlled manner (where appropriate and possible).
  • Perform multiple trials either as a cross-validation (multiple independent executions) or temporally (multiple applications as part of a timed sequence), depending on the problem at hand.
  • Make available random data splits (e.g., those used in data-driven approaches) or, at least, ensure splits are reproducibile.
  • Compare distributions (rather than means) of results using appropriate statistics.
  • Compare solutions using an appropriate meta-evaluation criteria16. Justify the chosen criteria.

Extraordinary

  • Analyze different parameter choices to the algorithm, indicating how the final parameters were selected17.
  • Analyze the fitness landscape for one or more of the chosen fitness functions.

General Quality Criteria

The most valuable quality criteria for optimization studies in SE include reliability, replicability, reproducibility, rigor, and usefulness (see Glossary).

Examples of Acceptable Deviations

  • The number of trials can be constrained by available time or experimental resources (e.g. where experiments are time-consuming to repeat or have human elements). In such cases, multiple trials are still ideal, but a limited number of trials can be justified as long as the limitations are disclosed and the possible effects of stochasticity are discussed.
  • The use of industrial case studies is important in demonstrating the real-world application of a proposed technique, but industrial data generally cannot be shared. In such cases, it is recommended that a small open-source example be prepared and distributed as part of a replication package to demonstrate how the approach can be applied.

Antipatterns

  • Reporting significance tests (e.g., Mann-Whitney Wilcoxon test) without effect size tests (see Notes)
  • Conducting multiple trials but failing to disclose or discuss the variation between trials; for instance reporting a measure of central (e.g. median) without any indication of variance (e.g., a boxplot).

Invalid Criticisms

  • The paper is unimportant. Be cautious of rejecting papers that seem “unimportant” (in the eyes of a reviewer). Research is exploratory and it is about taking risks. Clealy-motivated research and speculative exploration are both important and should be rewarded.
  • The paper just uses older algorithms with no reference to recent work. Using older (and widely understood algorithms) may be valid when they are used, e.g., (1) as part of a larger set that compares many approaches; e.g. (2) to offer a “straw man” method that defines the “floor” of the performance (that everything else needs to beat); or (3), as a workbench within which one thing is changed (e.g., the fitness function) but everything else remains constant.
  • That an approach is not benchmarked against an inappropriate or unavailable baseline. If a state-of-the-art approach lacks an available and functional implementation, it is not reasonable to expect the author to recreate that approach for benchmarking purposes.
  • That a multi-objective approach is not compared to a single-objective approach by evaluating each objective separately. This is not a meaningful comparison because, in a multi-objective problem, the trade-off between the objectives is a major factor in result quality. It is more important to consider the Pareto frontiers and quality indicators.
  • That one or very few subjects are used, as long as the paper offers a reasonable justification for why this was the case.

Suggested Readings

  • Shaukat Ali, Lionel C. Briand, Hadi Hemmati, Rajwinder Kaur Panesar-Walawege. 2010. A Systematic Review of the Application and Empirical Investigation of Search-Based Test Case Generation," in IEEE Transactions on Software Engineering, vol. 36, no. 6, pp. 742-762, DOI: https://doi.org/10.1109/TSE.2009.52
  • Andrea Arcuri and Lionel Briand. 2014. A Hitchhiker's guide to statistical tests for assessing randomized algorithms in software engineering. Softw. Test. Verif. Reliab. 24, 3, pp. 219–250. DOI: https://doi.org/10.1002/stvr.1486
  • Amritanshu Agrawal, Tim Menzies, Leandro L. Minku, Markus Wagner, and Zhe Yu. 2020. Better software analytics via DUO: Data mining algorithms using/used-by optimizers." Empirical Software Engineering 25, no. 3. pp.2099-2136. DOI: https://doi.org/10.1007/s10664-020-09808-9
  • Efron, Bradley, and Robert J. Tibshirani. An introduction to the bootstrap. CRC press, 1994
  • Mark Harman, Phil McMinn, Jerffeson Teixeira Souza, and Shin Yoo. 2011. Search-Based Software Engineering: Techniques, Taxonomy, Tutorial. Empirical Software Engineering and Verification. Lecture Notes in Computer Science, vol. 7007, pp. 1–59. DOI: https://doi.org/10.1007/978-3-642-25231-0_1
  • Vigdis By Kampenes, Tore Dybå, Jo E. Hannay, and Dag I. K. Sjøberg. 2007. Systematic review: A systematic review of effect size in software engineering experiments. Inf. Softw. Technol. 49, 11–12 (November, 2007), 1073–1086. DOI:https://doi.org/10.1016/j.infsof.2007.02.015
  • M. Li, T. Chen and X. Yao. 2020. How to Evaluate Solutions in Pareto-based Search-Based Software Engineering? A Critical Review and Methodological Guidance. In IEEE Transactions on Software Engineering. DOI: https://doi.org/10.1109/TSE.2020.3036108.
  • Nikolaos Mittas and Lefteris Angelis. Ranking and clustering software cost estimation models through a multiple comparisons algorithm. IEEE Trans. Software Eng., 39(4):537–551, 2013.
  • Guenther Ruhe. 2020. Optimization in Software Engineering - A Pragmatic Approach. In Felderer, M. and Travassos, G.H. eds., Contemporary Empirical Methods in Software Engineering, Springer. DOI: https://doi.org/10.1007/978-3-030-32489-6_9

Exemplars

  • Hussein Almulla, Gregory Gay. 2020. Learning How to Search: Generating Exception-Triggering Tests Through Adaptive Fitness Function Selection. In Proceedings of 13th IEEE International Conference on Software Testing (ICST’20). IEEE, 63-73. DOI: https://doi.org/10.1109/ICST46399.2020.00017
  • Jianfeng Chen, Vivek Nair, Rahul Krishna, Tim Menzies. “Sampling” as a Baseline Optimizer for Search-Based Software Engineering. IEEE Transactions on Software Engineering 2019 45(6), 2019. DOI: https://doi.org/10.1109/TSE.2018.279092
  • José Campos, Yan Ge, Nasser Albunian, Gordon Fraser, Marcelo Eler and Andrea Arcuri. 2018. An empirical evaluation of evolutionary algorithms for unit test suite generation. Information and Software Technology. vol. 104, pp. 207–235. DOI: https://doi.org/10.1016/j.infsof.2018.08.010
  • Feather, Martin S., and Tim Menzies. "Converging on the optimal attainment of requirements." Proceedings IEEE Joint International Conference on Requirements Engineering. IEEE, 2002.
  • G. Mathew, T. Menzies, N. Ernst and J. Klein. 2017. "SHORT"er Reasoning About Larger Requirements Models. In 2017 IEEE 25th International Requirements Engineering Conference (RE), Lisbon, Portugal, pp. 154-163. doi: 10.1109/RE.2017.3
  • Annibale Panichella, Fitsum Meshesha Kifetew and Paolo Tonella. 2018. Automated Test Case Generation as a Many-Objective Optimisation Problem with Dynamic Selection of the Targets. IEEE Transactions on Software Engineering. vol. 44, no. 2, pp. 122–158. DOI: https://doi.org/10.1109/TSE.2017.2663435
  • Federica Sarro, Filomena Ferrucci, Mark Harman, Alessandra Manna and Jen Ren. 2017. Adaptive Multi-Objective Evolutionary Algorithms for Overtime Planning in Software Projects. IEEE Transactions on Software Engineering, vol. 43, no. 10, pp. 898-917. DOI: https://doi.org/10.1109/TSE.2017.2650914
  • Federica Sarro, Alessio Petrozziello, and Mark Harman. 2016. Multi-objective software effort estimation. In Proceedings of the 38th International Conference on Software Engineering (ICSE'16). Association for Computing Machinery, New York, NY, USA, 619–630. DOI: https://doi.org/10.1145/2884781.2884830
  • Norbert Siegmund, Stefan Sobernig, and Sven Apel. 2017. Attributed variability models: outside the comfort zone. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE. Association for Computing Machinery, New York, NY, USA, 268–278. DOI: https://doi.org/10.1145/3106237.3106251

Notes

Regarding the difference between "significance" and "effect size" tests: "Significance" checks if distributions can be distinguished from each other while "Effect size" tests are required to check if the difference between distributions is "interesting" (and not just a trivially "small effect"). These tests can be parametric or non-parametric. For example, code for the parametric t-test/Hedges significance/effect tests endorsed by Kampenese et al. can be found at https://tinyurl.com/y4o7ucnx. Code for a parametric Scott-Knot/Cohen test of the kind endorsed by Mittas et al. is available at https://tinyurl.com/y5tg37fp. Code for the non-parametric bootstrap/Cliffs Delta significant/effect tests of the kind endorsed by Efron et al. and Arcuri et al. can be found at https://tinyurl.com/y2ufofgu.

Footnotes

1: Note that there are many such optimization techniques (metaheuristic; numerical optimizers; constraint solving theorem provers SAT,SMT,CSP; and other), some of which are stochastic.

2: E.g., test input creation, design refactoring, effort prediction.

3: A "fitness function", or "objective function", is a numerical scoring function used to indicate the quality of a solution to a defined problem. Optimization approaches attempt to maximize or minimize such functions, depending on whether lower or higher scores indicate success.

4: E.g., if the cross-product of the space of options is very large or if the time required to perform a task manually is very slow.

5: E.g., the numerical optimizer, the specific metaheuristic, the constraint solving method, etc.

6: For example, do not use an algorithm such as Simulated Annealing, or even a specific approach such as NSGA-II, to solve an optimization problem unless it is actually appropriate for that problem. While one rarely knows the best approach for a new problem, one should at least consider the algorithms applied to address similar problems and make an informed judgement.

7: If the approach addresses a problem never tackled before, then it should be compared - at least - to random search. Otherwise, compare the proposed approach to the existing state of the art.

8: E.g., a test suite or test case in test generation.

9: E.g., a tree or vector structure.

10: Example techniques - Simulated Annealing, Genetic Algorithm. Example heuristic - single-point crossover. Example parameters - crossover and mutation rates.

11: E.g., proprietary data, ethics issues, or a Non-Disclosure Agreement.

12: For example, stochasticity may arise from the use of randomized algorithms, from the use of a fitness function that measures a random variable from the environment (e.g., a fitness function based on execution time may return different results across different executions), from the use of data sampling or cross-validation approaches.

13: E.g., the approach is too slow, human-in-the-loop.

14: Reviewers should reward sound and novel work and, where possible, support a diverse range of studies.

15: Including, for example, source code (of approach, solution representation, and fitness calculations), datasets used as experiment input, and collected experiment data (e.g., output logs, generated solutions).

16: For example, if applying a multi-objective optimization approach, then use a criterion that can analyze the Pareto frontier of solutions (e.g., generational distance and inverse generational distance)

17: E.g., applying hyperparameter optimization.

sbse-sigsoft-standard's People

Contributors

drpaulralph avatar fedsar avatar greg4cr avatar rebecca-moussa avatar timm avatar timmenzies avatar xdevroey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

sbse-sigsoft-standard's Issues

Add two papers

From Guenther Ruhe:

Guenther Ruhe, Optimization in Software Engineering - A Pragmatic Approach. In Felderer, M. and Travassos, G.H. eds., Contemporary Empirical Methods in Software Engineering, Springer, 2020.

From Mark Harman:

Mark Harman, Phil McMinn, Jerffeson Teixeira Souza, and Shin Yoo. Search-Based Software Engineering: Techniques, Taxonomy, Tutorial. Empirical Software Engineering and Verification, Lecture Notes in Computer Science, vol. 7007, pp. 1–59, 2011

too many exampls just form sbse

I have one point of critique though. All examples seem to come from the field of SBSE. However, there are lots of other optimisation techniques. In particular, there's constraint solving (SAT,SMT,CSP, and other). These all deal with optimisation problems, and not all are stochastic solutions. Nevertheless, most points I would imagine apply (define search space, fitness/evaluation criteria etc.).

I would thus suggest to remove "(e.g., metaheuristics and evolutionary algorithms)" from the introduction; add additional examples: e.g., change
"The algorithm underlying an approach (e.g., the metaheuristic) "
to "The algorithm underlying an approach (e.g., the metaheuristic, CDCL)"
and change:
"One should sample from data multiple times in a controlled manner."
to "One should sample from data multiple times in a controlled manner, where appropriate." (or smth, as it's not always necessary)

I tried to find an exemplary paper that would compare SBSE and constraint-based approaches. There have been quite a few constraint-based approaches proposed for test input generation, but I wasn't sure which one to pick. Perhaps others might have suggestions.

Thanks again for all the efforts.

From Paul: Suggested contents of replication package

"To enable open science, a replication package should be made available that conforms to SIGSOFT standards for artifacts."

What are some of the possible components of this replication package? We should provide a list of suggested elements.

From Paul: Reformulate "support diverse range of studies" as specific attributes.

"We stress that the use of optimization in SE is still a rapidly evolving field. Hence, the following criteria are approximate and many exceptions exist to these criteria. Reviewers should reward sound and novel work and, where possible, support a diverse range of studies."

Paul states: "Take this paragraph out and try to build this flexibility into the essential attributes. Most readers are only going to read the essential, desirable, and extraordinary attributes so all the critical stuff has to go in there somehow".

From Paul: Clarify the following

  • Provide a detailed explanation on how subjects or datasets were collected and chosen in order to mitigate selection bias and improve the generalization of the findings. Describe the main features of the subjects used to run and evaluate the optimization approach(es) and discuss what characterizes the different instances in terms of "hardness". In case of data-driven approaches, if random data splits are used, this should be made publicly available or at least reproducible. In case of synthetic data, clearly explain why real-world data cannot be used, and to what extend the proposed approach and the findings can be applicable to a real-world setting.

This point is (a) quite long, and (b) not immediately clear. Would suggest an editing pass.

add paper

From Lionel Briand: "We tacked many of these issues in our IEEE TSE 2010 paper, in a SBST context: Ali et al., "A Systematic Review of the Application and Empirical Investigation of Search-Based Test-Case Generation".

(add to recommended reading, see if any other advice from it that we should apply)

option space

From Twitter:

Daniel Struber - "Looks great, thanks for this! I have a remark: "one should at least consider the option space" I think that this criterion needs to be more specific to allow a fair application, given that the option space (any available optimization technique) is huge."

struggled with “The effects of stochasticity must be understood"

I also struggled with “The effects of stochasticity must be understood and accounted for at all levels (e.g., in the use of randomized algorithms, in fitness functions that measure a random variable from the environment, and in data sampling)” which seems a bit broad for a new problem where you don’t understand half of the stochasticity yet.

It may be good to clarify what we mean by "understood and accounted for" and "at all levels" and how this applies to well-studied problems vs new problems or new approaches.

Feedback to incorporate into the draft.

From Twitter:

Daniel Struber - "Looks great, thanks for this! I have a remark: "one should at least consider the option space" I think that this criterion needs to be more specific to allow a fair application, given that the option space (any available optimization technique) is huge."

comment on relevance

From Lionel Briand: "The only thing that I would contend with is the discussion about “importance” or what I would call relevance. Research, by definition, is exploratory and about taking risks. But in an engineering discipline the problem should be well defined, with clearly justified assumptions."

siegmund says

From Norbert Siegmund:

There are some important aspects missing, especially criteria for the input space (generation), the description of the search space and why it is truly and an exponential problem, and clear discussion of threats to validity. Please refer to our paper: https://t.co/sNQK4DaaTQ?amp=1

In our paper together with SvenApel and Stefan Sobernig, we analyzed papers optimizing software configurations with variability models. Here, features (or options) of a SW system are modeled together with attributes, such as performance. Goal: Try to find the optimal config.

There are three validity issues common in most papers:

Non realisitc inputs: Attribute values have not been measured, but mostly generated by an arbitrary distribution that has no relation to value distributions in the wild. Optimizations should work with realistic data.
Exponential search space: Finding an optimal configuration in an exp number of combinations of SW features is an NP-hard problem. However, if there are only a linear number of effects, we can simply compute every configuration's value with a function having linear # of terms. Hence, the optimization task becomes trivial, because ignoring combination (or interaction) effects renders the problem to a linear one. We saw this problem simplification in all analyzed optimization papers by omiting interaction effects.
Threats to validity: We saw that early papers that define experimental setups have been reused by others. But, the threats to validity of these early setups have not been addressed and even not mentioned by those papers reusing the setup. Hence, make your limitations explicit.

From Paul: Should this attribute be moved to "desirable"?

  • Multiple trials can either be performed as a cross-validation (multiple independent executions) or temporally (multiple applications as part of a timed sequence), depending on the problem at hand.

Pauls believes this should shift to desirable as it "may not always apply". Do we agree?

adding a point under the "essential" category

This is a great initiative, and I would like to thank the authors and everyone contributing.

I would suggest adding a point under the "essential" category related to datasets used for evaluation. It is important that there is an appropriate justification of the dataset used for evaluation, and a description of the main features of the dataset that characterise the different problem instances in terms of "hardness". For example, if the size of the problem instances is an important feature that affects the performance of the optimisation approach, the dataset should be described in terms of this feature.

Under the "desirable category" I would suggest to include something around "efforts should be made to avoid any bias in the selection of the dataset"

Under the "invalid criticism" I would include "the paper uses only one dataset". Reviewers should provide valid criticism as to why that single dataset is not sufficient, or ask for more clarification from the authors. Papers should not be rejected because the authors use a single dataset.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.