nexb / deltacode Goto Github PK

View Code? Open in Web Editor NEW

19.0 16.0 27.0 190.46 MB

DeltaCode: compare two codebase scans (from ScanCode) to detect significant changes.

Home Page: http://www.aboutcode.org/

Batchfile 1.75% Python 94.74% Shell 2.64% Dockerfile 0.45% Makefile 0.40%

scancode oss-compliance software-licensing deltacode

deltacode's People

Contributors

Stargazers

Watchers

deltacode's Issues

Add Notice to output json file

Fix to_dict() method

Something is slowing this down a lot, compared to previous commits/versions.

Cloning repo on Windows 10 creates 'tcl' directory

When the deltacode repo is cloned on Windows 10, followed by running ./configure.bat --clean and then ./configure.bat, a tcl directory appears after the the last step has finished. My local version of the former spats-deltacode repo does not contain a tcl directory, and I didn't encounter that directory during my previous work in spats-deltacode. After checking out a new branch and launching Visual Studio Code, VSCode indicates that there are 976 "pending changes" -- all of them evidently in this tcl directory.

My current resolution: I've deleted the tcl directory from inside my local branch.

Add deltas_count field

We should have a deltas_count field in JSON output, similar to scancode-toolkit

May need some discussion as to how to handle the -a option

Add 'errors' attribute to 'DeltaCode' object

Collect exceptions in this errors field when they get raised.

Expands on issue #6 (collect errors).

Add end-to-end tests

We need end-to-end tests for both our outputs in various scenarios. This probably can be addressed along with #37, as the end-to-end could be incorporated via mock cli calls.

This ticket stems out of a few minor issues I've run into after we do refactoring. Having end-to-end tests in place makes sure that as we refactor and add features, cli workflows are not broken by our changes.

Check and determine ScanCode options present in a ScanCode data file

In order to compare codebases accurately, Deltacode needs scancode data files that have the full file information available, at the very least.

We also need to know what additional scancode options each Deltacode input file has in order to figure out what other scan data is present and therefore what type of license, copyright or other data we can compare.

implement release process

We copy the release.sh from scan code-toolkit

remove references to scancode
replace with deltacode

DeltaCode release cleaning/prep

Update setup.py and other metadata in prep for DeltaCode's public release.

Handle data checks at object creation time (constructor)

In many class functions we have data checks like this:

if self.object is None and self.other_thing is None:
  return

We should really handle these in the constructor to reduce LOC.

Limit output to changes only by default

Add in additional file-level info for json results

Update License object by removing unnecessary fields.
a) update License.to_dict() as well
Update License object and License.to_dict() tests
Update Delta object and Delta.to_dict() tests
Update remaining json-based tests (if needed)

Match scoring

When we have a matching set of files in deltacode, we need to have some way of scoring or weighing a match.

This will also come into play more when we start to incorporate license and copyright changes.

This score will ultimately take the place of modified string in our match object.

Add factors to Delta Object

add a 'factors' in lieu of 'category'.

The goal here is to append various 'factors' as we run our codebase thru the various DeltaCode steps (determine_delta moved etc), primarily so we do not have to adjust category often.

if there are relevant factors that go into a particular File pair in a Delta object (a license addition or change, for instance), then than information will simply be appending the this 'factors' list.

When we go to output, all the Deltas will be sorted in descending order by score and we can simply dump the contents of a Delta's 'factors' field into a cell (in the csv case).

Removing unused test data/tests/other data

Also, make that each individual test file calls its own test data.

AssertionError when running DeltaCode on eCos scans

I ran ScanCode with the the following options (-clipeu) on version 2.0 of eCos and the latest HEAD of the eCos CVS repo. After, I ran DeltaCode on the report files and I got the following issue:

$ deltacode -n ecos-head.json -o ~/Desktop/ecos-2.0-linux.json -c delta.csv
Traceback (most recent call last):
  File "/home/jono/nexb/tools/develop/deltacode/bin/deltacode", line 11, in <module>
    load_entry_point('deltacode', 'console_scripts', 'deltacode')()
  File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/jono/nexb/tools/develop/deltacode/src/deltacode/cli.py", line 80, in cli
    delta = DeltaCode(new, old)
  File "/home/jono/nexb/tools/develop/deltacode/src/deltacode/__init__.py", line 23, in __init__
    self.deltas = self.determine_delta()
  File "/home/jono/nexb/tools/develop/deltacode/src/deltacode/__init__.py", line 103, in determine_delta
    assert len(deltas) == ((self.new.files_count - new_nonfiles) + (self.old.files_count - old_nonfiles) - modified - unchanged)
AssertionError

Attached are the input files I used to get this error:
ecos-scans.zip

Handle 'moved' files

This will need some thinking, but we will want some way to tell if a file has been 'moved' between the new and old scans of some codebase.

This means the sha1 should be matching, but the path would not be. There are also cases where the same file could be present in multiple locations.

We may need to index the files by sha1, similar to what we did in determine_delta

Add 'score' attribute to 'Delta' object

Expands issue #3 (match scoring).

update docstrings to imperative style

For examples we have docstrings like:

"""
Handles the basic operations...
"""

to:

"""
Handle the basic operations...
"""

Add filetype info (or similar) to the output.

Along with type and path, we should add filetype (or something similar) to ease filtering of results. Ofter we do not care about config files or Makefiles when analyzing a codebase.

Add deltacode and deltacode entrypoint script

Similar to scancode, we need to add a deltacode and ./deltacode.bat top level script.

This entrypoint script allows a user to simply run ./deltacode after cloning or downloading our deltacode repo. Its main responsibility is handle the initial configuration automatically.

This makes it easier from and end-user point of view. We will also want to include this in our release script.

Identify version changes

[@mjherzog comment] A significant subset of the Added/Removed files in a DeltaCode comparison are likely due to a version change for the same component. This will be complex to solve because the version number may be embedded in the path for a source code directory and/or in the filename for a Development or Deployment component. But this will also be very valuable because upgrading component versions between product releases is extremely common.

Collect errors

[@MaJuRG comment] stop printing error messages to the console, and log the errors instead in the output.

Determine Score implementation

related to: #3

For now, should should simply add small values to our score at the different times. Later on we can think about subtracting for lack of info (license or otherwise) or Permissive licenses etc

Add Travis CI integrations

Need to add travis configuration and setup for deltacode. This include the slack hook as well.

Add proper license and copyright headers to all source files

Look at https://github.com/nexb/scancode-toolkit for examples. We can reuse the copyright and license header from there.

Update to handle new(ish) scancode results

In particular, the way file counts are handled is slightly different.

Simplify deltacode output

For the csv output, it would be a better presentation if we simply included a single path value, instead of empty or repeated paths that are redundant.

subclass with object for all Classes

To use modern python2/3 style classes, we need to make sure all of our objects are subclassed with 'object': class SomeThing(object):

use license scan information to further determine modified.

@mjherzog brought this up in a recent call.

For the 'modified' set of files we should look at the license scan information to further distinguish the type of modification.

For example, two files can effectively be marked unmodified if the license scan information (i.e. license key and/or expression) is the same for both File objects. This type of check is related to #3 as well.

convert README to .rst

Currently our README is in markdown form.

In order to conform with nexB + pypi standards, our README file should be .rst or re-structured text.

The doc is simple enough to edit manually, but there are also automated options: http://avilpage.com/2014/11/pandoc-best-way-to-convert-markdown-to.html

Consolidate scoring in 'Delta' method

This issue continues the work started with issue #52, in which a hard-coded score is assigned to the new Delta.score attribute when the various Delta categories are created, i.e., in:

DeltaCode.determine_delta(),
DeltaCode.update_deltas() and
Delta._license_diff().

empty path strings in outputs appear after some alignments

When running

deltacode -n tests/data/deltacode/ecos-failed-counts-assertion-new.json -o tests/data/deltacode/ecos-failed-counts-assertion-old.json -c ~/test.csv

Many modified or license change Deltas show an empty path value in our output. I believe this happens as a result of align_scan(), but I could be wrong. I will investigate further and post more details.

Add date and platform to JSON output

[@johnmhoran comment] In terms of record-keeping, a user might find it helpful if the JSON output includes the date/time at which the JSON was generated and the platform on which DeltaCode was run.

Make `_license_diff` its own deltacode function.

Similar to determine moved, we move this to DeltaCode.

The primary purpose of this function is simply to modify the Delta score field, depending on the license information between two Files.

We can keep it simple for now, and use a similar algorithm to that currently in _license_diff()

Migrate to using json-to-csv script in lieu of separate formatting option

Instead of maintaining two different 'branches' of output, we should default to json only.

We will still want a way to view the results in csv form, but this can be moved to a separate script that only takes a deltacode json output and converts it to csv.

UX should not change: -c output should still output a csv file. It is in the internals that we would just make a call to our json2csv script instead of write_csv

Collect/calculate additional statistics

From @MaJuRG We can get basic counts currently in deltacode; we need to expand that to additional calculations like % added, removed, etc, % changed, perhaps some sort of codebase 'drift' calculation that incorporates a number of different stats.

Add cli tests

Use the scancode cli.py tests as a model. We mainly want to verify correct cli output in certain scenarios.

Add windows builds

We currently only have linux builds running via Travis CI. We should have all platforms.

Refactor 'generate_csv()'

I think we might be able to reduce the 38 or so lines used in generate_csv() to construct the .csv tuple down to around 13 lines by using a ternary/conditional expression, e.g.,

new = '' if delta['category'] == 'removed' else delta['new']['path']

Initial testing suggests this works as expected -- all 45 tests pass. The refactoring would be applied here:

deltacode/src/deltacode/cli.py

Lines 26 to 66 in 551f231

 for delta in deltas: 

 category = delta['category'] 

 if delta['category'] == 'added': 

 new = delta['new']['path'] 

 new_filename = delta['new']['name'] 

 new_sha1 = delta['new']['sha1'] 

 new_size = delta['new']['size'] 

 new_type = delta['new']['type'] 

 new_orig = delta['new']['original_path'] 

 old = '' 

 old_filename = '' 

 old_sha1 = '' 

 old_size = '' 

 old_type = '' 

 old_orig = '' 

 elif delta['category'] == 'removed': 

 new = '' 

 new_filename = '' 

 new_sha1 = '' 

 new_size = '' 

 new_type = '' 

 new_orig = '' 

 old = delta['old']['path'] 

 old_filename = delta['old']['name'] 

 old_sha1 = delta['old']['sha1'] 

 old_size = delta['old']['size'] 

 old_type = delta['old']['type'] 

 old_orig = delta['old']['original_path'] 

 else: 

 new = delta['new']['path'] 

 new_filename = delta['new']['name'] 

 new_sha1 = delta['new']['sha1'] 

 new_size = delta['new']['size'] 

 new_type = delta['new']['type'] 

 new_orig = delta['new']['original_path'] 

 old = delta['old']['path'] 

 old_filename = delta['old']['name'] 

 old_sha1 = delta['old']['sha1'] 

 old_size = delta['old']['size'] 

 old_type = delta['old']['type'] 

 old_orig = delta['old']['original_path']

Add macos builds

Since we support it, we should have this as well.

Failing case if extra directory is added

I did some simple tests and here is my finding:

I use "balloontip-1.1.1.jar" as a sample file.

Created 2 directories d1/ and d2/ and put the test file in it and then compare these 2 directories. The output is unchanged which is correct.
Same setup as (1) but create a new subdirectory named test/ under d1/ and put balloontip-1.1.1.jar in it.
Both the
d1/balloontip-1.1.1.jar
d1/test/balloontip-1.1.1.jar
are returned as added.

and the d2/balloontip-1.1.1.jar is returned as removed

which is not correct as the d1/balloontip-1.1.1.jar and d2/balloontip-1.1.1.jar should return unchanged while the d1/test/balloontip-1.1.1.jar is consider as added.

Same setup as (1) but create a new root/ directory and put the d1/ in it and run the deltacode from root/ to d2/. The output is unchanged which is correct.

Adjust License Diff to account for License Category

Auditors care more about Copyleft and Proprietary licenses showing up in a codebase. We need to adjust our scoring so that more emphasis is give to:

'no license' -> 'copyleft limited or higher'
'permissive' -> 'copyleft limited or higher'
'anything -> 'proprietary/commercial'

There are probably other combinations here as well.

update travis to use new `deltacode` script

increment counter in deteremine_delta()

Instead of decrementing a counter in determine_delta(), we should increment it and check against files_count and the end, instead of 0

We should also add some error message when the assertion fails.

Further refine license_diff() to determine license info added or license info removed

Instead of returning 'license change', we should have indications when the new side now has license info when the old side has none, and vice-versa.

Flatten DeltaCode.deltas field

Once we have the basics of scoring (#52), we can move on to flattening the deltas field of DeltaCode. This means moving on a dictionary of lists to just a single list of Delta objects.

Prior to output or perhaps prior to assignment, we will want to sort this list by Delta.score to preserve order and/or cull entries we do not need (if a user specifies --all for instance)

Only pass score during Delta creation.

change to Delta(score, new_file, old_file) in all the places that it was created.

This involves removing 'category' from the Delta object

Our initial scores should be simple for now: an Added should +100 to a score (which is 0 by default). A Removed will simply not add any value to a score. Both cases will also have the factors: 'added' or 'removed' appending to the Delta.factors

During license_diff and other DeltaCode steps, we will simply add values to the score as things are found out about the Delta.

Identify dupes

We have means of (crudely) identifying moved files. However, this only looks at the cases where there is a single file per sha1 value.

We need a way of handling files there appears multiple times on either/both sides of scans.

This will need to be broken into smaller tickets

Add support for copyright holder diff

Similar to License Diff, we will need to add a simple (at first) copyright diff function for the delta object.

	for delta in deltas:
	category = delta['category']
	if delta['category'] == 'added':
	new = delta['new']['path']
	new_filename = delta['new']['name']
	new_sha1 = delta['new']['sha1']
	new_size = delta['new']['size']
	new_type = delta['new']['type']
	new_orig = delta['new']['original_path']
	old = ''
	old_filename = ''
	old_sha1 = ''
	old_size = ''
	old_type = ''
	old_orig = ''
	elif delta['category'] == 'removed':
	new = ''
	new_filename = ''
	new_sha1 = ''
	new_size = ''
	new_type = ''
	new_orig = ''
	old = delta['old']['path']
	old_filename = delta['old']['name']
	old_sha1 = delta['old']['sha1']
	old_size = delta['old']['size']
	old_type = delta['old']['type']
	old_orig = delta['old']['original_path']
	else:
	new = delta['new']['path']
	new_filename = delta['new']['name']
	new_sha1 = delta['new']['sha1']
	new_size = delta['new']['size']
	new_type = delta['new']['type']
	new_orig = delta['new']['original_path']
	old = delta['old']['path']
	old_filename = delta['old']['name']
	old_sha1 = delta['old']['sha1']
	old_size = delta['old']['size']
	old_type = delta['old']['type']
	old_orig = delta['old']['original_path']

nexb / deltacode Goto Github PK

deltacode's People

Contributors

Stargazers

Watchers

Forkers

deltacode's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs