mrshoenel / metrics-as-scores Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 0.0 39.69 MB

Contains the data and scripts needed for the application Metrics as Scores

Home Page: https://mrshoenel.github.io/metrics-as-scores/

License: Other

Python 18.05% HTML 0.95% JavaScript 0.01% CSS 0.06% Batchfile 0.01% TeX 1.70% Jupyter Notebook 79.22%

metrics-as-scores's People

Contributors

Stargazers

Watchers

metrics-as-scores's Issues

Functionality

One of the items in the JOSS review criteria #1 is:

Functionality: Have the functional claims of the software been confirmed?

Many have, but I still don't understand some things.

I haven't seen an interface to the goodness-of-fit tests.

Do I understand that MAS does not report the results of these tests directly? Instead, after fitting the distributions to the data, several goodness-of-fit tests are run to determine which distribution is best? If so, how are the two-sample tests used? (e.g. generate random data from the fitted distribution and compare that against the real data?) How are the results of multiple tests combined to choose the best distribution? I just need a high-level overview, since I maintain implementations of these functions and wrote scipy.stats.goodness_of_fit.

Exactly how do the data transforms work? For example, does the "expectation" transform simply subtract the sample mean from the sample?

I haven't run across these when working with MAS.

How are they used/accessed?

Citations/References

One of the items in the JOSS review criteria #8 is:

References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Most of this is satisfied, but there are a few places that are questionable - mostly the citations that are included within parentheses already.

For example:

should probably not be an "author-in-text citation".

I'm not sure, but it looks like you want to do this by adding "ANOVA; " as a "prefix" to the citation:

rather than explicitly adding parentheses.

Here are some other cases that are questionable.

Please do your best to satisfy the criterion, and if you're not sure what to do, we'll have to ask the editor.

Automated Tests

One of the items in the JOSS review criteria #1 is:

Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?

I see that there are pytest tests, but there are no instructions for executing them.

Can they be executed if the package is installed using pip? For instance, import scipy; scipy.test() executes SciPy tests.
How should the tests be run if the package is installed for development? For instance, running SciPy's tests is not as simple as executing pytest at the command line; there are developer tools runtests.py and nowdev.py that should be used to run the test suite.

Please add documentation so that I can run the test suite. After #2 is complete, I will be able to assess whether the test suite is adequate. Thanks!

General Editing

The adjacent sentences that describe what ANOVA does should probably be combined
Capitalization after a colon is not appropriate unless the colon introduces a complete sentence. Here, the colon introduces a phrase / incomplete sentence.
Does "feature" refer to the quantity being measured or the numerical values of the measurement? (I would have thought the former, but then I would ask whether it is really the quantity being transformed; I would think it is the numerical values.) Also, although (unfortunately) "sample" is used to refer to both a single measurement and a collection of measurements, I've never seen "observation" to be used to refer to a collection of measurements. Consider whether "sample" is more appropriate than "observation" here.
From a certain perspective, a "unit change" of score would result in a value beyond the [0-1] range. Consider "fixed increment" , "given increment", or similar to avoid misunderstanding.
Is the genetic algorithm here not Pymoo? If so, we need to consider whether the software that implements the genetic algorithm (and all the tests, distributions, etc.) should be cited. We should check with the editor as to whether it is appropriate for me to suggest this since I am biased, but it would be surprising to me if it is appropriate to cite Pymoo and not the other software.
Typo in "of"
Please research the use of "suspension hyphenation" to determine whether it is appropriate in these situations.

Meta-issue for mdhaber review of JOSS submission

This is a detailed list of notes corresponding with openjournals/joss-reviews#4913. The checklist below may be modified as the review progresses. I'll create a separate issue for any items that require substantial discussion.

Repository: Is the source code for this software available at the https://github.com/MrShoenel/metrics-as-scores?
- Yes, but paper title in readme.md should match title of JOSS submission. Update: I don't remember exactly where I was looking before, but the readme doesn't seem to refer to this paper anymore.
License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
- Yes, GPLv3.
Contribution and authorship: Has the submitting author (@MrShoenel) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
- Yes, @MrShoenel is the only contributor. But in that case, how did the other authors contribute to the project?
Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines?
- See "Details" for specific considerations. Question for the editor: most of the commits took place over a period of two months, but at least one commit toward the beginning (266d010) suggests that there was substantial work before this. Does this meet the standard that "As a rule of thumb, JOSS’ minimum allowable contribution should represent not less than three months of work for an individual"?

Age of software (is this a well-established software project) / length of commit history.
- Most the commits occured in August and September
Number of commits.
- 160
Number of authors
- 1
Total lines of code (LOC). Submissions under 1000 LOC will usually be flagged, those under 300 LOC will be desk rejected.
- Nearly 2000 Python lines in src plus web app. Much of the Python code wraps existing code, though. Are there key algorithmic parts I should look at?
Whether the software has already been cited in academic papers.
- I don't see any
Whether the software is sufficiently useful that it is likely to be cited by your peer group.
- TBD
In addition, JOSS requires that software should be feature-complete (i.e., no half-baked solutions)
- This doesn't look half-baked
packaged appropriately according to common community standards for the programming language being used (e.g., Python, R),
- yes, it's PIP-installable
designed for maintainable extension (not one-off modifications of existing tools).
- yes
“Minor utility” packages, including “thin” API clients, and single-function packages are not acceptable.
- This is not a single-function package

Data sharing: If the paper contains original data, data are accessible to the reviewers. If the paper contains no original data, please check this item.
- ~~I don't think there is any original data. The Qualitas.class corpus itself is not claimed as part of this paper. Is this accurate?~~ Update: there are three datasets mentioned in the Readme, and one is mentioned in the paper. All are available and documented by accompanying PDFs.
Reproducibility: If the paper contains original results, results are entirely reproducible by reviewers. If the paper contains no original results, please check this item.
- I don't think any original results are claimed. Is this accurate? Update: yes. "There are no new results claimed in the JOSS paper, just the dataset as its own publication. (#1 (comment))"
Human and animal research: If the paper contains original data research on humans subjects or animals, does it comply with JOSS's human participants research policy and/or animal research policy? If the paper contains no such data, please check this item.

Functionality

Installation: Does installation proceed as outlined in the documentation?
- pip install metrics-as-scores seems to have completed successfully. I don't see any instructions for testing the installation or running the software locally, though.
- ~~I did not attempt the "Stand-alone Usage / Development Setup".~~ Development installation worked.
Functionality: Have the functional claims of the software been confirmed?
- Maybe. I would like the authors to list the functional claims of the software concisely before I judge this. Since I am not finding instructions for interacting with the software locally, I am relying on https://metrics-as-scores.ml/webapp. There, I see lots of probability density functions overlayed on the same graph. IIUC, each of them was generated by fitting some data to ~120 distributions and keeping only the best fit. But I have many questions.
  - What does the data represent? I'm still not sure what the Qualitas.class corpus data is. (I'd suggest showing less by default. This is a lot to be confronted with.)
  - There are a few fitting metrics that are listed - which is considered when selecting the best fit? (Looks like KS 2-sample. Why? This will be stochastic, and there are deterministic statistics available.)
  - Are there any claims about the statistical interpretation of the results, or are statistical methods being used for convenience? As an obvious example of what I'm looking for: I don't think the software claims that the observed data were drawn from the fitted distribution, but if it did, I would say that this has not been confirmed because it would be an abuse of goodness-of-fit tests to make such a claim.
  - The paper mentions lots of functionality that does not seem to be demonstrated by the web interface - ANOVA, TukeyHSD, etc.
Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)
- Authors - I don't see anything that I would consider to be a performance claim ("fast", "memory-efficient", "accurate"). Is that correct?
- Addressed by #1 (comment)

Documentation

A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
- It is stated, but IMO the language used is too abstract to be easily interpreted by a general audience or even a computational statistics developer (myself). I would suggest adding a very concrete example to demonstrate what is meant by key terms "raw data", "metrics", "scores", "distance", "context". I think that in a domain-independent statistics context, I would call "raw data" -> "sample(s)", "metric" -> "statistic" (because I do not think it satisfies the mathematical definition of a "metric"), "scores" -> something related to a CDF fitted either parametrically or nonparametrically. I confident that I understand "distance" and "context" correctly, though.
- Is there a way to distinguish between "metric" as a mathematical function and "metric" as the numerical value that function assumes given particular "raw data"? Similar question for "score".
Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
- Maybe? The example website seems to show how the software works on real-world data, but I don't understand what that data is. The "Use your own data" might satisfy this criterion, but IMO it should include a much simpler example with a minimum number of "raw data" samples, and it should show each of the claimed features individually (e.g. empirical distribution, KDE, MLE, ANOVA, TukeyHSD). See #5.
Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
- I don't see any rendered API documentation, and AFAICT, none of the Python code has docstrings. See #2.
Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
- There are some tests. Pytest is not listed as a dependency, and there are no instructions for running the tests. I am not sure whether there are adequate tests because there is not documentation for public functions. See #4.
Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support.
- ~~I don't see any. See #3.~~ Update: this has been resolved.

Software paper

Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
- ~~There is a summary, but I don't think it's clear to non-specialists. A simple, concrete example would help. I'll link to a separate issue about this.~~
- This has been improved.
A statement of need: Does the paper have a section titled 'Statement of need' that clearly states what problems the software is designed to solve, who the target audience is, and its relation to other work?
- ~~I think that once the introduction is more accessible, the statement of need will satisfy this criterion with minor adustments.~~
- This has been improved.
State of the field: Do the authors describe how this software compares to other commonly-used packages?
- No, I don't think so. See #9
Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
- Yes, mostly. I can make an issue with copyediting suggestions when the paper is closer to its final form.
References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?
- It is extensive. I'll have to check for citation syntax.
- See #8

Community Guidelines

One of the items in the JOSS review criteria #1 is:

Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

I don't see any of this information in the repository.
Please add these items to satisfy this criterion (or point me to the items I am missing). Thanks!

State of the Field

One of the items in the JOSS review criteria #1 is:

State of the field: Do the authors describe how this software compares to other commonly-used packages?

I didn't find this information the first time I read the paper nor in a recent re-read. If this description is present, it may need to be emphasized; if it is not present, it needs to be added.

API Documentation

One of the items in the JOSS review criteria #1 is:

Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?

I don't see any rendered API documentation or even docstrings in the code.
Please add detailed documentation to satisfy this criterion (or point me to the documentation I am missing). Thanks!

Example Usage

One of the items in the JOSS review criteria #1 is:

Example usage:** Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).

The software appears to have been developed primarily to analyze "Qualitas.class". I don't believe that satisfies this criterion because:

This analysis appears to be too complex to describe in a step-by-step form
The criterion suggests that there should be "examples" (plural), and it appears that the software has only been tested on the "Qualitas.class" dataset.

Can you develop a simpler example usage tutorial?

IIUC the key terms correctly, perhaps the following scenario is applicable. Note that words (e.g. metric) are used here in the colloquial sense; it is not necessarily correct to interpret these words according to the definitions used in the context of this software.

I teach a senior design class in which several teams of students designed and built small cranes for lifting and moving heavy loads (e.g. 2000 lbf to a height of 10 ft and within ~10ft radius). We wanted to compare the achievement of different cranes based on their performance in a series of standardized tests: maximum load lifted, maximum height, ease of use, etc. After the cranes completed the tests, we ranked the teams' performance in each test separately. For instance, suppose three teams competed in the "maximum load lifted" test, and the max loads they lifted were:

Team 1: 1600 lb

Team 2: 2100 lb

Team 3: 1850 lb
We ranked the team's performance in each test (e.g. in the maximum load test, Team 2 ranked 1st, Team 3 2nd, and Team 1 3rd). After ranking all tests in this way, I observed that one team (say Team 2) ranked 1st in four tests, 2nd in one test, and 3rd in one test. Subjectively, I determined that Team 2 was the overall winner based on these rankings.

Such a procedure is probably fine for such a small, informal competition (in which there were only 6 teams and 6 tests). However, I imagined that techniques from statistics might be used to improve the fairness of my competition. For instance, suppose that teams from every mechanical engineering department in the country were to design cranes and perform the same set of tests. In this case, I would have a fairly good estimate of the distribution of values that might be achieved a maximum load test (comparing against a relevant population). The next year, if teams from my department were to design cranes and compete in the same test, I would be able to use the percentile of the maximum load lifted by the crane as a score. Effectively, I'd be ranking them against a large population of relevant teams rather than merely ranking them w.r.t. the five other teams in my department. The scores in all six test would now be on a standardized scale (0 - 1), so perhaps then I could come up with an objective criterion to aggregate the scores from multiple tests in order to choose an overall winner of the competition.

What relationships, if any, can you see between your software and this scenario? (If none, perhaps we need a different scenario for the next two questions.)
Can you help me interpret the capabilities of this software in terms of this scenario or a similarly simple, concrete scenario?
Can you add documentation demonstrating the usage of key features of this software in such a context (even with synthetic raw data)?

Thanks!

TUI fails to open web app with external process

I tried out the TUI. When I tried to run the web app as an external process, the GUI did not automatically show up in my browser. I tried browsing to http://localhost:5678/webapp, but the site couldn't be reached. When I quit, I saw an error:

An error (FileNotFoundError) has occurred: [WinError 2] The system cannot find the file specified

Restarting main menu.

What other information would you need to fix this?

Meta-issue for kostiantyn-kucher review of JOSS submission

This is an issue created for my review of the JOSS manuscript, as discussed at openjournals/joss-reviews#4913

I am glad to state that I can cross off all the items from the review checklist (please take my previous comments regarding the potential conflicts of interest into account, though), but I have some additional comments as part of my review that I would like to mention next.

In my opinion, the proposed tool indeed supports the functionality claimed in the documentation and the JOSS manuscript, while the underlying methodology and its validity (especially in the context of software quality metrics) are discussed in detail in the prior publication by the authors (Hönel et al., 2022). The statement of need, the general description of the tool, and information regarding installation, license, and contribution procedures are provided within the manuscript and the repository, respectively. I also tried downloading, building, running the tests, and using the tool locally with the Iris data set, with my experience corresponding to the expectations set by the documentation (the web application currently deployed online at https://metrics-as-scores.ml/ is also functioning as expected with the Qualitas.class data set).

Still, I have several further comments and suggestions—mainly for the JOSS manuscript, but also for the implementation:

Given my background in information visualization and visual analytics, I am currently missing any sort of motivation for the design choices regarding the interactive interface (see the section “MAS – The Interactive Application”). The basic question here is: why are line plots used as the visual representations of distributions? The answer for this could be quite straightforward (related to conventions, authors’ and target users’ expectations, prior work, pragmatic considerations, etc.), but in my opinion, such motivation should appear in the manuscript.
The manuscript provides a glimpse into the initial motivation behind designing this approach and applying it within the context of software quality metrics research; however, since MAS is positioned as a more general tool, I would strongly suggest extending the “Applications” section within the manuscript with (at least) the list of the currently included/supported data sets (including Iris, etc.), ideally with a very brief discussion of why and how would these data sets be analyzed within MAS.
With respect to related work, I would suggest to extend the discussion with the following studies and tools:

CorpusVis (https://doi.org/10.1109/VISSOFT.2019.00020): visual exploration of software metrics in the Qualitas Corpus
moreThanANOVA (https://doi.org/10.1371/journal.pone.0271185)
seaborn (https://doi.org/10.21105/joss.03021): support for numerous statistical visual representations)

Several notes related not only to the manuscript, but also the implementation and repository documentation (could be addressed directly or considered as part of future work, for instance):
4.1) While the current Bokeh distribution plot implementation is quite straightforward, I would strongly suggest extending the contents of the tooltip (even visible as part of Figure 1) with the group/domain value (currently must be checked separately within the main plot legend, and would not work that well when a large number of groups and thus colors/hues is present).
4.2) In the “Applications” section, the following is mentioned: “In addition, some of the software metrics in the corpus are never similar across application domains and must be applied with great care when used in quality models (Hönel et al., 2022).” — while this is mainly relevant to the particular software quality data set, I would suggest mentioning such undesirable choices briefly as part of the data set description within the tool UI (around the “Description” section under “Loaded Dataset”).
Minor presentation issues and potential improvements within the manuscript:
5.1) line 17: “allows to assess” — please see https://english.stackexchange.com/a/196130 , https://english.stackexchange.com/q/60271 , and https://ell.stackexchange.com/q/11193
5.2) lines 39+: I would suggest a brief addition at the end of the first section (just one or two sentences) that would provide the readers with an overall idea for how to install and run the tool (i.e., installation with common Python command-line tools + an interactive terminal app + a web-based interactive exploration tool)
5.3) line 75: “which provides access to the PDF/PMF, CDF/CCDF (for scores), and the PPF” — please expand the abbreviations/acronyms on first use
5.4) lines 79–80: “Cramér–von Mises (Cramér, 1928) and Kolmogorov–Smirnov one-sample (Stephens, 1974) tests, Cramér–von Mises (Anderson, 1962), …” — please check if two different references should indeed be used for Cramér–von Mises tests
5.5) Please check the bibliography carefully for ensuring a consistent style, especially with respect to the capitalization within the publication titles (“Deriving metric thresholds from benchmark data” vs “On the Distribution of the Two-Sample Cramer-von Mises Criterion”; notice “Pymoo: Multi-objective optimization in python” with lowercased “python”, among others…) as well as consistent publication venue titles (“26th IEEE International Conference on Software Maintenance (ICSM 2010)” vs “Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering”, for instance).
5.5.1) Please double-check the author names: “Carleton, W. A., Anita D.; Florac.”
5.5.2) Please double-check the author names: “John M. Chambers, R. M. H., Anne E. Freeny” (surnames mentioned first across the currently used bibliographical style).

Infinite error loop when quitting web app run as internal application server

I tried running the web app as an internal application server from the TUI. The web app popped up in my browser and seemed to work correctly. When I pressed q and <enter> in the TUI to close it, I got an infinite loop of errors.

Here's a short video.

What other information would you need to debug this?

mrshoenel / metrics-as-scores Goto Github PK

metrics-as-scores's People

Contributors

Stargazers

Watchers

metrics-as-scores's Issues

Functionality

Documentation

Software paper

Recommend Projects

Recommend Topics

Recommend Org

Jobs