GithubHelp home page GithubHelp logo

mh-malekpour / ml-satd-quantification Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 2.0 971 KB

A research project analyzing and quantifying the resolution efforts of Self-Admitted Technical Debt (SATD) in Machine Learning (ML) projects.

Python 8.21% Jupyter Notebook 91.79%

ml-satd-quantification's Introduction

Quantifying Interest of Self-Admitted Technical Debt in Machine Learning Projects

This is the final project for the SOEN-691: Engineering AI-based Software Systems course.

Abstract

Technical debt (TD) in software development represents the long-term costs incurred from short-term solutions and workarounds. Self-Admitted Technical Debt (SATD) refers to situations where developers acknowledge the creation of TD through source code comments. This research focuses on identifying and quantifying the types of TD in Machine Learning (ML) applications that present greater resolution challenges. We utilize software engineering metrics to evaluate the effort required to resolve SATDs in ML projects, shedding light on prioritization needs in ML development.

Introduction

The concept of "technical debt" (TD) highlights the trade-offs between expedited completion and software quality. As ML models become integral to various domains, understanding the specific types of ML SATD that require more effort to resolve is crucial for effective ML development.

Methodology

Our study analyzes a dataset of SATD instances in ML projects, focusing on resolved SATDs to calculate the effort required for their removal. We employ metrics such as commit size, cyclomatic complexity, and fan-in to quantify this effort. Additionally, we use Principal Component Analysis (PCA) and weighted averages to rank the ML SATD types and pipeline stages based on their resolution effort.

Results

Our findings indicate that Modularity and Data Acquisition are the ML SATD types and pipeline stages, respectively, that demand more effort to resolve. Poor modularity often leads to challenges like circular dependencies, while TDs in the Data Acquisition stage, though quickly addressed, typically involve fewer modifications in terms of lines of code (LOC).

Discussion and Implications

The results suggest that ML developers should prioritize addressing technical debt related to Modularity and Data Acquisition to enhance the efficiency of ML development. Researchers are encouraged to explore the impact of SATD on ML model metrics and develop guidelines for managing SATD in ML projects.

Related Work

  • OBrien et al. (2022) developed a taxonomy of ML SATDs and analyzed their characteristics and removal efforts.
  • Kamei et al. (2016) proposed a method using software product metrics to quantify the interest on SATD.
  • Zampetti et al. (2018) conducted an in-depth study on the removal of SATD in Java open-source projects.

Conclusion

This study highlights the importance of addressing technical debt related to Modularity and Data Acquisition in ML projects. By quantifying the effort required to resolve SATDs, we provide actionable insights for ML developers and researchers to prioritize and manage technical debt effectively.

Authors

ml-satd-quantification's People

Contributors

mh-malekpour avatar

Stargazers

Mayra Ruiz avatar Adekunle A. Ajibode avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.