Comments (22)
@MaJuRG Yes, since our sample space is not limited, I mean the sample space is infinite for the addition of files. So, >100% is likely to happen.
I would make a PR tonight. :)
from deltacode.
@arnav-mandal1234 yes that it ok
from deltacode.
[@johnmhoran comment] @MaJuRG Couple of questions:
-
Do we want these data points as additional individual columns in the CSV output?
-
The current perspective for the comparison is: what needs to be done to the
old
scan to generate thenew
scan. However, some % stats may be more informative looking at theold
scan and others at thenew
scan.
- If I'm thinking about
removed
, I believe I want % ofold
scan files removed. - For
added
-- % ofnew
scan files added. modified
-- % ofnew
scan files.unchanged
- % ofnew
scan files.
Do I have the perspectives correct? Might we want to provide % for the 4 categories measured against both old
and new
in case a user might consider that data informative?
from deltacode.
[@MaJuRG comment] 1) No, we just want to add this info to out stats dictionary for now.
- think about it in terms of
len(matches)
where matches is our list of Added, Match, Removed (aka, the total number of files).
Then % added would be:
count_files = len(matches)
%added = count_added / (count_files - count_removed)
%removed = count_removed / (count_files - count_added)
... and so on
My math may be wrong on these, but I think you are thinking of it correctly.
from deltacode.
[@johnmhoran comment] OK. Thanks.
from deltacode.
[@MaJuRG comment] Clearly, this will require some tests to make sure out math is correct 😀
from deltacode.
[@johnmhoran comment] 👍
from deltacode.
[@johnmhoran comment] @MaJuRG Do we want the percents to be rounded rather than the full floating-point length, e.g., 0.111
rather than 0.1111111111111111
?
from deltacode.
[@MaJuRG comment] yes
from deltacode.
[@johnmhoran comment] @MaJuRG Note that if we calculate, for example, like this:
count_files = len(deltas)
percent_added = round(float(added) / (count_files - removed), 3)
percent_modified = round(float(modified) / (count_files - added), 3)
percent_removed = round(float(removed) / (count_files - added), 3)
percent_unchanged = round(float(unchanged) / (count_files - added), 3)
we get the percentages from the perspectives I think we seek, but the total may add up to more than 100%.
from deltacode.
[@pombredanne comment] @johnmhoran
Float operations may add up exactly, especially when rounded.
In general you should not round before computing, but only after and when you need to return values for output. And round to keep two digit, not three. You want to normalize these values to 100 too. And avoid repetition of computations with a function to make this readable.
Also use from __future__ import division
at the top, so you will have a true division and no need to wrap things in a float
And IMHO you should compute the portions of added/removed/etc ... not against the delta but against the old or new files count. Proportions of deltas does not seem very useful as a stat? I would even argue that percentages are misleading and just returning totals may be enough and much simpler for now....
- An added or unchanged or changed ratio only makes sense against old or new.
- Added only makes sense against new.
- Removed makes only sense against old.
e.g. it is easy to create misleading ratios: what help ratios can bring? Total change counts are not misleading as their base is clearly always the old. And they are a clear indicator of the magnitude of the change (and the possible amount of work that will be needed to re-analyze)...
So if you really really want to compute these percentages, this may look like this...
from __future__ import division
[.....]
def percent(value, total):
"""
Return the rounded value percentage of total.
"""
ratio = (value / total) *100
return round(ratio, 2)
old_files_count = len(old_files)
new_files_count = len(new_files)
# I assume here that added, modified, etc are counts and not lists
added_from_old = percent(added, old_files_count)
added_from_new = percent(added, new_files_count)
modified_from_old = percent(modified , old_files_count)
modified_from_new = percent(modified , new_files_count)
removed_from_old = percent(removed, old_files_count)
unchanged_from_old = percent(unchanged, old_files_count)
# or to ensure that the total always adds up to 100%
# unchanged_from_old = 100 - added_from_old - modified_from_old - removed_from_old
unchanged_from_new = percent(unchanged, new_files_count)
from deltacode.
[@mjherzog comment] It is essential human nature to want to convert the stats to percentages - percentages are no more/less misleading than the stats but the percentage definitions are actually helpful in pointing to which comparisons might be meaningful.
from deltacode.
[@johnmhoran comment] @MaJuRG @pombredanne @mjherzog We seem to have a disagreement about whether we want to collect/calculate percentages and, if so, how they should be calculated.
In the current commit, if my understanding is correct, count_files
is the total number of added
, modified
, removed
and unchanged
files. count_files - added
is the file count for the old codebase, and count_files - removed
is the file count for the new codebase. Thus, we currently calculate % for modified
, removed
and unchanged
against the old codebase, and for added
, against the new codebase. That seems most useful to me, but I could be wrong.
Perhaps the simplest approach would be to calculate a set of percentages against the old codebase, and another against the new, label them clearly, and let the user decide what he or she finds most helpful.
How would you like me to proceed with this issue?
from deltacode.
[@mjherzog comment] I think that the use case definition comes back to us more than other users since even we struggle a bit with the definitions. I think that we can focus on one direction to start which is new -> old. And we want the percentages to always add up to 100%.
So we have two simple percentages - unchanged and modified with the new filecount as denominator.
Added/Removed is much less clear since it is probably is some combination of Moved, Added and Removed. So I think that we can defer percentage calculations for now since they are pretty easy to do from the stats in a spreadsheet and focus on what we can start doing to identify Moved within the Added/Removed bucket.
from deltacode.
This may be a semantic detail, but I find the use of "unchanged" and "modified" confusing. I think that the terminology should be "changed"/"unchanged" or "modified"/"unmodified".
from deltacode.
@mjherzog agreed, I prefer the modified/unmodified approach personally.
from deltacode.
I suggest that we defer this idea and focus on the data outputs - this is not hard to calculate in a spreadsheet and there the user can choose whether to calculate New vs Old or vice versa.
from deltacode.
Can I work on this? @johnmhoran @MaJuRG @pombredanne
from deltacode.
@arnav-mandal1234 sure
from deltacode.
@MaJuRG @pombredanne @mjherzog @johnmhoran after going through the whole discussion above, I think we should make all the percentage variables with the base of count(old_files). This is because our basic idea is used to compare old to new. So, generally, users would like to see what changes have been made to new_directory as compared to old_directory. What are your opinions?
Please guide me if I am wrong.
from deltacode.
@arnav-mandal1234 Yes, you are correct. We would want the base to be count(old_files)
for these calculations. If the new
directory has many more files added, percentages will be > 100%
, but I think that is what we want to see anyway.
from deltacode.
@MaJuRG This can be done in many ways.
I selected this approach:
- make a class "stats" in init.py just like a Delta class
- Then will initialize the object of stat class in DeltaCode class
- proceed to make the following changes.
Is that okay?
from deltacode.
Related Issues (20)
- Add limited delta stats after running HOT 1
- Upgrade DeltaCode to Python 3 HOT 4
- Create DeltaCode documentation on ReadTheDocs HOT 6
- Linux and MacOs buid is showing some warnings in TravisCI HOT 3
- configure failed,why? HOT 10
- Adding Azure Pipelines HOT 1
- In the output content format ‘[ ’how to understand? HOT 2
- Configure failed with "file setup.py not found" HOT 11
- Azure Piplines seems to be filing for Windows Test Jobs
- Create objects to score scan information HOT 6
- Add function to handle loading 2 codebases. HOT 2
- Remove redundant Scan model HOT 1
- Remove redundant File model
- Add Dockerfile
- Separate csv formatted output in Deltacode
- Merge DeltaCode in ScanCode TK
- Update structure to use the https://github.com/nexB/skeleton
- RFC: DeltaCode next! and roadmap HOT 2
- Update documentation after deltacode gets merge in scancode-toolkit HOT 2
- License detection diffs are incorrect HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deltacode.