From <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/u

Collect/calculate additional statistics about deltacode HOT 22 CLOSED

johnmhoran commented on May 28, 2024

Collect/calculate additional statistics

from deltacode.

Comments (22)

arnav-mandal1234 commented on May 28, 2024 1

@MaJuRG Yes, since our sample space is not limited, I mean the sample space is infinite for the addition of files. So, >100% is likely to happen.
I would make a PR tonight. :)

from deltacode.

steven-esser commented on May 28, 2024 1

@arnav-mandal1234 yes that it ok

from deltacode.

johnmhoran commented on May 28, 2024

[@johnmhoran comment] @MaJuRG Couple of questions:

Do we want these data points as additional individual columns in the CSV output?
The current perspective for the comparison is: what needs to be done to the old scan to generate the new scan. However, some % stats may be more informative looking at the old scan and others at the new scan.

If I'm thinking about removed, I believe I want % of old scan files removed.
For added -- % of new scan files added.
modified -- % of new scan files.
unchanged - % of new scan files.

Do I have the perspectives correct? Might we want to provide % for the 4 categories measured against both old and new in case a user might consider that data informative?

from deltacode.

johnmhoran commented on May 28, 2024

[@MaJuRG comment] 1) No, we just want to add this info to out stats dictionary for now.

think about it in terms of len(matches) where matches is our list of Added, Match, Removed (aka, the total number of files).

Then % added would be:

count_files = len(matches)
%added = count_added / (count_files - count_removed)
%removed = count_removed / (count_files - count_added)
... and so on

My math may be wrong on these, but I think you are thinking of it correctly.

from deltacode.

johnmhoran commented on May 28, 2024

[@johnmhoran comment] OK. Thanks.

from deltacode.

johnmhoran commented on May 28, 2024

[@MaJuRG comment] Clearly, this will require some tests to make sure out math is correct 😀

from deltacode.

johnmhoran commented on May 28, 2024

[@johnmhoran comment] 👍

from deltacode.

johnmhoran commented on May 28, 2024

[@johnmhoran comment] @MaJuRG Do we want the percents to be rounded rather than the full floating-point length, e.g., 0.111 rather than 0.1111111111111111?

from deltacode.

johnmhoran commented on May 28, 2024

[@MaJuRG comment] yes

from deltacode.

johnmhoran commented on May 28, 2024

[@johnmhoran comment] @MaJuRG Note that if we calculate, for example, like this:

count_files = len(deltas)
percent_added = round(float(added) / (count_files - removed), 3)
percent_modified = round(float(modified) / (count_files - added), 3)
percent_removed = round(float(removed) / (count_files - added), 3)
percent_unchanged = round(float(unchanged) / (count_files - added), 3)

we get the percentages from the perspectives I think we seek, but the total may add up to more than 100%.

from deltacode.

johnmhoran commented on May 28, 2024

[@pombredanne comment] @johnmhoran
Float operations may add up exactly, especially when rounded.

In general you should not round before computing, but only after and when you need to return values for output. And round to keep two digit, not three. You want to normalize these values to 100 too. And avoid repetition of computations with a function to make this readable.

Also use from __future__ import division at the top, so you will have a true division and no need to wrap things in a float

And IMHO you should compute the portions of added/removed/etc ... not against the delta but against the old or new files count. Proportions of deltas does not seem very useful as a stat? I would even argue that percentages are misleading and just returning totals may be enough and much simpler for now....

An added or unchanged or changed ratio only makes sense against old or new.
Added only makes sense against new.
Removed makes only sense against old.

e.g. it is easy to create misleading ratios: what help ratios can bring? Total change counts are not misleading as their base is clearly always the old. And they are a clear indicator of the magnitude of the change (and the possible amount of work that will be needed to re-analyze)...

So if you really really want to compute these percentages, this may look like this...

from __future__ import division
[.....]
def percent(value, total):
    """
    Return the rounded value percentage of total. 
    """
    ratio = (value / total) *100
    return round(ratio, 2)
    
old_files_count = len(old_files)
new_files_count = len(new_files)

# I assume here that added, modified, etc are counts and not lists
added_from_old = percent(added, old_files_count)
added_from_new = percent(added, new_files_count)

modified_from_old = percent(modified , old_files_count)
modified_from_new = percent(modified , new_files_count)

removed_from_old = percent(removed, old_files_count)

unchanged_from_old = percent(unchanged, old_files_count)
# or to ensure that the total always adds up to 100%
# unchanged_from_old = 100 - added_from_old - modified_from_old - removed_from_old
unchanged_from_new = percent(unchanged, new_files_count)

from deltacode.

johnmhoran commented on May 28, 2024

[@mjherzog comment] It is essential human nature to want to convert the stats to percentages - percentages are no more/less misleading than the stats but the percentage definitions are actually helpful in pointing to which comparisons might be meaningful.

from deltacode.

johnmhoran commented on May 28, 2024

[@johnmhoran comment] @MaJuRG @pombredanne @mjherzog We seem to have a disagreement about whether we want to collect/calculate percentages and, if so, how they should be calculated.

In the current commit, if my understanding is correct, count_files is the total number of added, modified, removed and unchanged files. count_files - added is the file count for the old codebase, and count_files - removed is the file count for the new codebase. Thus, we currently calculate % for modified, removed and unchanged against the old codebase, and for added, against the new codebase. That seems most useful to me, but I could be wrong.

Perhaps the simplest approach would be to calculate a set of percentages against the old codebase, and another against the new, label them clearly, and let the user decide what he or she finds most helpful.

How would you like me to proceed with this issue?

from deltacode.

johnmhoran commented on May 28, 2024

[@mjherzog comment] I think that the use case definition comes back to us more than other users since even we struggle a bit with the definitions. I think that we can focus on one direction to start which is new -> old. And we want the percentages to always add up to 100%.
So we have two simple percentages - unchanged and modified with the new filecount as denominator.
Added/Removed is much less clear since it is probably is some combination of Moved, Added and Removed. So I think that we can defer percentage calculations for now since they are pretty easy to do from the stats in a spreadsheet and focus on what we can start doing to identify Moved within the Added/Removed bucket.

from deltacode.

mjherzog commented on May 28, 2024

This may be a semantic detail, but I find the use of "unchanged" and "modified" confusing. I think that the terminology should be "changed"/"unchanged" or "modified"/"unmodified".

from deltacode.

steven-esser commented on May 28, 2024

@mjherzog agreed, I prefer the modified/unmodified approach personally.

from deltacode.

mjherzog commented on May 28, 2024

I suggest that we defer this idea and focus on the data outputs - this is not hard to calculate in a spreadsheet and there the user can choose whether to calculate New vs Old or vice versa.

from deltacode.

arnav-mandal1234 commented on May 28, 2024

Can I work on this? @johnmhoran @MaJuRG @pombredanne

from deltacode.

steven-esser commented on May 28, 2024

@arnav-mandal1234 sure

from deltacode.

arnav-mandal1234 commented on May 28, 2024

@MaJuRG @pombredanne @mjherzog @johnmhoran after going through the whole discussion above, I think we should make all the percentage variables with the base of count(old_files). This is because our basic idea is used to compare old to new. So, generally, users would like to see what changes have been made to new_directory as compared to old_directory. What are your opinions?

Please guide me if I am wrong.

from deltacode.

steven-esser commented on May 28, 2024

@arnav-mandal1234 Yes, you are correct. We would want the base to be count(old_files) for these calculations. If the new directory has many more files added, percentages will be > 100%, but I think that is what we want to see anyway.

from deltacode.

arnav-mandal1234 commented on May 28, 2024

@MaJuRG This can be done in many ways.
I selected this approach:

make a class "stats" in init.py just like a Delta class
Then will initialize the object of stat class in DeltaCode class
proceed to make the following changes.

Is that okay?

from deltacode.

Collect/calculate additional statistics about deltacode HOT 22 CLOSED

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs