mrshoenel / metrics-as-scores Goto Github PK
View Code? Open in Web Editor NEWContains the data and scripts needed for the application Metrics as Scores
Home Page: https://mrshoenel.github.io/metrics-as-scores/
License: Other
Contains the data and scripts needed for the application Metrics as Scores
Home Page: https://mrshoenel.github.io/metrics-as-scores/
License: Other
One of the items in the JOSS review criteria #1 is:
Functionality: Have the functional claims of the software been confirmed?
Many have, but I still don't understand some things.
Do I understand that MAS does not report the results of these tests directly? Instead, after fitting the distributions to the data, several goodness-of-fit tests are run to determine which distribution is best? If so, how are the two-sample tests used? (e.g. generate random data from the fitted distribution and compare that against the real data?) How are the results of multiple tests combined to choose the best distribution? I just need a high-level overview, since I maintain implementations of these functions and wrote scipy.stats.goodness_of_fit
.
How are they used/accessed?
One of the items in the JOSS review criteria #8 is:
References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?
Most of this is satisfied, but there are a few places that are questionable - mostly the citations that are included within parentheses already.
should probably not be an "author-in-text citation".
I'm not sure, but it looks like you want to do this by adding "ANOVA; " as a "prefix" to the citation:
rather than explicitly adding parentheses.
Here are some other cases that are questionable.
Please do your best to satisfy the criterion, and if you're not sure what to do, we'll have to ask the editor.
One of the items in the JOSS review criteria #1 is:
Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
I see that there are pytest tests, but there are no instructions for executing them.
import scipy; scipy.test()
executes SciPy tests.pytest
at the command line; there are developer tools runtests.py
and nowdev.py
that should be used to run the test suite.Please add documentation so that I can run the test suite. After #2 is complete, I will be able to assess whether the test suite is adequate. Thanks!
The adjacent sentences that describe what ANOVA does should probably be combined
Capitalization after a colon is not appropriate unless the colon introduces a complete sentence. Here, the colon introduces a phrase / incomplete sentence.
Does "feature" refer to the quantity being measured or the numerical values of the measurement? (I would have thought the former, but then I would ask whether it is really the quantity being transformed; I would think it is the numerical values.) Also, although (unfortunately) "sample" is used to refer to both a single measurement and a collection of measurements, I've never seen "observation" to be used to refer to a collection of measurements. Consider whether "sample" is more appropriate than "observation" here.
From a certain perspective, a "unit change" of score would result in a value beyond the [0-1] range. Consider "fixed increment" , "given increment", or similar to avoid misunderstanding.
Is the genetic algorithm here not Pymoo? If so, we need to consider whether the software that implements the genetic algorithm (and all the tests, distributions, etc.) should be cited. We should check with the editor as to whether it is appropriate for me to suggest this since I am biased, but it would be surprising to me if it is appropriate to cite Pymoo and not the other software.
Please research the use of "suspension hyphenation" to determine whether it is appropriate in these situations.
This is a detailed list of notes corresponding with openjournals/joss-reviews#4913. The checklist below may be modified as the review progresses. I'll create a separate issue for any items that require substantial discussion.
readme.md
should match title of JOSS submission. Update: I don't remember exactly where I was looking before, but the readme doesn't seem to refer to this paper anymore.pip install metrics-as-scores
seems to have completed successfully. I don't see any instructions for testing the installation or running the software locally, though.One of the items in the JOSS review criteria #1 is:
Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support
I don't see any of this information in the repository.
Please add these items to satisfy this criterion (or point me to the items I am missing). Thanks!
One of the items in the JOSS review criteria #1 is:
State of the field: Do the authors describe how this software compares to other commonly-used packages?
I didn't find this information the first time I read the paper nor in a recent re-read. If this description is present, it may need to be emphasized; if it is not present, it needs to be added.
One of the items in the JOSS review criteria #1 is:
Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
I don't see any rendered API documentation or even docstrings in the code.
Please add detailed documentation to satisfy this criterion (or point me to the documentation I am missing). Thanks!
One of the items in the JOSS review criteria #1 is:
Example usage:** Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
The software appears to have been developed primarily to analyze "Qualitas.class". I don't believe that satisfies this criterion because:
Can you develop a simpler example usage tutorial?
IIUC the key terms correctly, perhaps the following scenario is applicable. Note that words (e.g. metric) are used here in the colloquial sense; it is not necessarily correct to interpret these words according to the definitions used in the context of this software.
I teach a senior design class in which several teams of students designed and built small cranes for lifting and moving heavy loads (e.g. 2000 lbf to a height of 10 ft and within ~10ft radius). We wanted to compare the achievement of different cranes based on their performance in a series of standardized tests: maximum load lifted, maximum height, ease of use, etc. After the cranes completed the tests, we ranked the teams' performance in each test separately. For instance, suppose three teams competed in the "maximum load lifted" test, and the max loads they lifted were:
- Team 1: 1600 lb
- Team 2: 2100 lb
- Team 3: 1850 lb
We ranked the team's performance in each test (e.g. in the maximum load test, Team 2 ranked 1st, Team 3 2nd, and Team 1 3rd). After ranking all tests in this way, I observed that one team (say Team 2) ranked 1st in four tests, 2nd in one test, and 3rd in one test. Subjectively, I determined that Team 2 was the overall winner based on these rankings.Such a procedure is probably fine for such a small, informal competition (in which there were only 6 teams and 6 tests). However, I imagined that techniques from statistics might be used to improve the fairness of my competition. For instance, suppose that teams from every mechanical engineering department in the country were to design cranes and perform the same set of tests. In this case, I would have a fairly good estimate of the distribution of values that might be achieved a maximum load test (comparing against a relevant population). The next year, if teams from my department were to design cranes and compete in the same test, I would be able to use the percentile of the maximum load lifted by the crane as a score. Effectively, I'd be ranking them against a large population of relevant teams rather than merely ranking them w.r.t. the five other teams in my department. The scores in all six test would now be on a standardized scale (0 - 1), so perhaps then I could come up with an objective criterion to aggregate the scores from multiple tests in order to choose an overall winner of the competition.
Thanks!
I tried out the TUI. When I tried to run the web app as an external process, the GUI did not automatically show up in my browser. I tried browsing to http://localhost:5678/webapp
, but the site couldn't be reached. When I quit, I saw an error:
An error (FileNotFoundError) has occurred: [WinError 2] The system cannot find the file specified
Restarting main menu.
What other information would you need to fix this?
This is an issue created for my review of the JOSS manuscript, as discussed at openjournals/joss-reviews#4913
I am glad to state that I can cross off all the items from the review checklist (please take my previous comments regarding the potential conflicts of interest into account, though), but I have some additional comments as part of my review that I would like to mention next.
In my opinion, the proposed tool indeed supports the functionality claimed in the documentation and the JOSS manuscript, while the underlying methodology and its validity (especially in the context of software quality metrics) are discussed in detail in the prior publication by the authors (Hönel et al., 2022). The statement of need, the general description of the tool, and information regarding installation, license, and contribution procedures are provided within the manuscript and the repository, respectively. I also tried downloading, building, running the tests, and using the tool locally with the Iris data set, with my experience corresponding to the expectations set by the documentation (the web application currently deployed online at https://metrics-as-scores.ml/ is also functioning as expected with the Qualitas.class data set).
Still, I have several further comments and suggestions—mainly for the JOSS manuscript, but also for the implementation:
Given my background in information visualization and visual analytics, I am currently missing any sort of motivation for the design choices regarding the interactive interface (see the section “MAS – The Interactive Application”). The basic question here is: why are line plots used as the visual representations of distributions? The answer for this could be quite straightforward (related to conventions, authors’ and target users’ expectations, prior work, pragmatic considerations, etc.), but in my opinion, such motivation should appear in the manuscript.
The manuscript provides a glimpse into the initial motivation behind designing this approach and applying it within the context of software quality metrics research; however, since MAS is positioned as a more general tool, I would strongly suggest extending the “Applications” section within the manuscript with (at least) the list of the currently included/supported data sets (including Iris, etc.), ideally with a very brief discussion of why and how would these data sets be analyzed within MAS.
With respect to related work, I would suggest to extend the discussion with the following studies and tools:
Several notes related not only to the manuscript, but also the implementation and repository documentation (could be addressed directly or considered as part of future work, for instance):
4.1) While the current Bokeh distribution plot implementation is quite straightforward, I would strongly suggest extending the contents of the tooltip (even visible as part of Figure 1) with the group/domain value (currently must be checked separately within the main plot legend, and would not work that well when a large number of groups and thus colors/hues is present).
4.2) In the “Applications” section, the following is mentioned: “In addition, some of the software metrics in the corpus are never similar across application domains and must be applied with great care when used in quality models (Hönel et al., 2022).” — while this is mainly relevant to the particular software quality data set, I would suggest mentioning such undesirable choices briefly as part of the data set description within the tool UI (around the “Description” section under “Loaded Dataset”).
Minor presentation issues and potential improvements within the manuscript:
5.1) line 17: “allows to assess” — please see https://english.stackexchange.com/a/196130 , https://english.stackexchange.com/q/60271 , and https://ell.stackexchange.com/q/11193
5.2) lines 39+: I would suggest a brief addition at the end of the first section (just one or two sentences) that would provide the readers with an overall idea for how to install and run the tool (i.e., installation with common Python command-line tools + an interactive terminal app + a web-based interactive exploration tool)
5.3) line 75: “which provides access to the PDF/PMF, CDF/CCDF (for scores), and the PPF” — please expand the abbreviations/acronyms on first use
5.4) lines 79–80: “Cramér–von Mises (Cramér, 1928) and Kolmogorov–Smirnov one-sample (Stephens, 1974) tests, Cramér–von Mises (Anderson, 1962), …” — please check if two different references should indeed be used for Cramér–von Mises tests
5.5) Please check the bibliography carefully for ensuring a consistent style, especially with respect to the capitalization within the publication titles (“Deriving metric thresholds from benchmark data” vs “On the Distribution of the Two-Sample Cramer-von Mises Criterion”; notice “Pymoo: Multi-objective optimization in python” with lowercased “python”, among others…) as well as consistent publication venue titles (“26th IEEE International Conference on Software Maintenance (ICSM 2010)” vs “Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering”, for instance).
5.5.1) Please double-check the author names: “Carleton, W. A., Anita D.; Florac.”
5.5.2) Please double-check the author names: “John M. Chambers, R. M. H., Anne E. Freeny” (surnames mentioned first across the currently used bibliographical style).
I tried running the web app as an internal application server from the TUI. The web app popped up in my browser and seemed to work correctly. When I pressed q
and <enter>
in the TUI to close it, I got an infinite loop of errors.
Here's a short video.
What other information would you need to debug this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.