nexb / deltacode Goto Github PK
View Code? Open in Web Editor NEWDeltaCode: compare two codebase scans (from ScanCode) to detect significant changes.
Home Page: http://www.aboutcode.org/
DeltaCode: compare two codebase scans (from ScanCode) to detect significant changes.
Home Page: http://www.aboutcode.org/
Something is slowing this down a lot, compared to previous commits/versions.
When the deltacode repo is cloned on Windows 10, followed by running ./configure.bat --clean
and then ./configure.bat
, a tcl
directory appears after the the last step has finished. My local version of the former spats-deltacode repo does not contain a tcl
directory, and I didn't encounter that directory during my previous work in spats-deltacode. After checking out a new branch and launching Visual Studio Code, VSCode indicates that there are 976 "pending changes" -- all of them evidently in this tcl
directory.
My current resolution: I've deleted the tcl
directory from inside my local branch.
We should have a deltas_count field in JSON output, similar to scancode-toolkit
May need some discussion as to how to handle the -a
option
Collect exceptions in this errors
field when they get raised.
Expands on issue #6 (collect errors).
We need end-to-end tests for both our outputs in various scenarios. This probably can be addressed along with #37, as the end-to-end could be incorporated via mock cli calls.
This ticket stems out of a few minor issues I've run into after we do refactoring. Having end-to-end tests in place makes sure that as we refactor and add features, cli workflows are not broken by our changes.
In order to compare codebases accurately, Deltacode needs scancode data files that have the full file information available, at the very least.
We also need to know what additional scancode options each Deltacode input file has in order to figure out what other scan data is present and therefore what type of license, copyright or other data we can compare.
We copy the release.sh
from scan code-toolkit
Update setup.py
and other metadata in prep for DeltaCode's public release.
In many class functions we have data checks like this:
if self.object is None and self.other_thing is None:
return
We should really handle these in the constructor to reduce LOC.
When we have a matching set of files in deltacode, we need to have some way of scoring or weighing a match.
This will also come into play more when we start to incorporate license and copyright changes.
This score will ultimately take the place of modified
string in our match object.
add a 'factors' in lieu of 'category'.
The goal here is to append various 'factors' as we run our codebase thru the various DeltaCode steps (determine_delta moved etc), primarily so we do not have to adjust category often.
if there are relevant factors that go into a particular File pair in a Delta object (a license addition or change, for instance), then than information will simply be appending the this 'factors' list.
When we go to output, all the Deltas will be sorted in descending order by score and we can simply dump the contents of a Delta's 'factors' field into a cell (in the csv case).
Also, make that each individual test file calls its own test data.
I ran ScanCode with the the following options (-clipeu
) on version 2.0 of eCos and the latest HEAD of the eCos CVS repo. After, I ran DeltaCode on the report files and I got the following issue:
$ deltacode -n ecos-head.json -o ~/Desktop/ecos-2.0-linux.json -c delta.csv
Traceback (most recent call last):
File "/home/jono/nexb/tools/develop/deltacode/bin/deltacode", line 11, in <module>
load_entry_point('deltacode', 'console_scripts', 'deltacode')()
File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/home/jono/nexb/tools/develop/deltacode/src/deltacode/cli.py", line 80, in cli
delta = DeltaCode(new, old)
File "/home/jono/nexb/tools/develop/deltacode/src/deltacode/__init__.py", line 23, in __init__
self.deltas = self.determine_delta()
File "/home/jono/nexb/tools/develop/deltacode/src/deltacode/__init__.py", line 103, in determine_delta
assert len(deltas) == ((self.new.files_count - new_nonfiles) + (self.old.files_count - old_nonfiles) - modified - unchanged)
AssertionError
Attached are the input files I used to get this error:
ecos-scans.zip
This will need some thinking, but we will want some way to tell if a file has been 'moved' between the new and old scans of some codebase.
This means the sha1 should be matching, but the path would not be. There are also cases where the same file could be present in multiple locations.
We may need to index the files by sha1, similar to what we did in determine_delta
Expands issue #3 (match scoring).
For examples we have docstrings like:
"""
Handles the basic operations...
"""
to:
"""
Handle the basic operations...
"""
Along with type and path, we should add filetype (or something similar) to ease filtering of results. Ofter we do not care about config files or Makefiles when analyzing a codebase.
Similar to scancode, we need to add a deltacode
and ./deltacode.bat
top level script.
This entrypoint script allows a user to simply run ./deltacode
after cloning or downloading our deltacode repo. Its main responsibility is handle the initial configuration automatically.
This makes it easier from and end-user point of view. We will also want to include this in our release script.
[@mjherzog comment] A significant subset of the Added/Removed files in a DeltaCode comparison are likely due to a version change for the same component. This will be complex to solve because the version number may be embedded in the path for a source code directory and/or in the filename for a Development or Deployment component. But this will also be very valuable because upgrading component versions between product releases is extremely common.
[@MaJuRG comment] stop printing error messages to the console, and log the errors instead in the output.
related to: #3
For now, should should simply add small values to our score
at the different times. Later on we can think about subtracting for lack of info (license or otherwise) or Permissive licenses etc
Need to add travis configuration and setup for deltacode. This include the slack hook as well.
Look at https://github.com/nexb/scancode-toolkit for examples. We can reuse the copyright and license header from there.
In particular, the way file counts are handled is slightly different.
For the csv output, it would be a better presentation if we simply included a single path value, instead of empty or repeated paths that are redundant.
To use modern python2/3 style classes, we need to make sure all of our objects are subclassed with 'object': class SomeThing(object):
@mjherzog brought this up in a recent call.
For the 'modified' set of files we should look at the license scan information to further distinguish the type of modification.
For example, two files can effectively be marked unmodified if the license scan information (i.e. license key and/or expression) is the same for both File objects. This type of check is related to #3 as well.
Currently our README is in markdown form.
In order to conform with nexB + pypi standards, our README file should be .rst
or re-structured text.
The doc is simple enough to edit manually, but there are also automated options: http://avilpage.com/2014/11/pandoc-best-way-to-convert-markdown-to.html
This issue continues the work started with issue #52, in which a hard-coded score is assigned to the new Delta.score
attribute when the various Delta
categories are created, i.e., in:
DeltaCode.determine_delta()
,DeltaCode.update_deltas()
andDelta._license_diff()
.When running
deltacode -n tests/data/deltacode/ecos-failed-counts-assertion-new.json -o tests/data/deltacode/ecos-failed-counts-assertion-old.json -c ~/test.csv
Many modified or license change Deltas show an empty path value in our output. I believe this happens as a result of align_scan()
, but I could be wrong. I will investigate further and post more details.
[@johnmhoran comment] In terms of record-keeping, a user might find it helpful if the JSON output includes the date/time at which the JSON was generated and the platform on which DeltaCode was run.
Similar to determine moved, we move this to DeltaCode.
The primary purpose of this function is simply to modify the Delta score field, depending on the license information between two Files.
We can keep it simple for now, and use a similar algorithm to that currently in _license_diff()
Instead of maintaining two different 'branches' of output, we should default to json only.
We will still want a way to view the results in csv form, but this can be moved to a separate script that only takes a deltacode json output and converts it to csv.
UX should not change: -c
output should still output a csv file. It is in the internals that we would just make a call to our json2csv script instead of write_csv
From @MaJuRG We can get basic counts currently in deltacode; we need to expand that to additional calculations like % added, removed, etc, % changed, perhaps some sort of codebase 'drift' calculation that incorporates a number of different stats.
Use the scancode cli.py
tests as a model. We mainly want to verify correct cli output in certain scenarios.
We currently only have linux builds running via Travis CI. We should have all platforms.
I think we might be able to reduce the 38 or so lines used in generate_csv()
to construct the .csv
tuple down to around 13 lines by using a ternary/conditional expression, e.g.,
new = '' if delta['category'] == 'removed' else delta['new']['path']
Initial testing suggests this works as expected -- all 45 tests pass. The refactoring would be applied here:
deltacode/src/deltacode/cli.py
Lines 26 to 66 in 551f231
Since we support it, we should have this as well.
I did some simple tests and here is my finding:
I use "balloontip-1.1.1.jar" as a sample file.
Created 2 directories d1/
and d2/
and put the test file in it and then compare these 2 directories. The output is unchanged which is correct.
Same setup as (1) but create a new subdirectory named test/
under d1/
and put balloontip-1.1.1.jar in it.
Both the
d1/balloontip-1.1.1.jar
d1/test/balloontip-1.1.1.jar
are returned as added.
and the d2/balloontip-1.1.1.jar
is returned as removed
which is not correct as the d1/balloontip-1.1.1.jar
and d2/balloontip-1.1.1.jar
should return unchanged while the d1/test/balloontip-1.1.1.jar
is consider as added.
root/
directory and put the d1/
in it and run the deltacode from root/
to d2/
. The output is unchanged which is correct.Auditors care more about Copyleft and Proprietary licenses showing up in a codebase. We need to adjust our scoring so that more emphasis is give to:
There are probably other combinations here as well.
Instead of decrementing a counter in determine_delta()
, we should increment it and check against files_count
and the end, instead of 0
We should also add some error message when the assertion fails.
Instead of returning 'license change', we should have indications when the new side now has license info when the old side has none, and vice-versa.
Once we have the basics of scoring (#52), we can move on to flattening the deltas
field of DeltaCode. This means moving on a dictionary of lists to just a single list of Delta objects.
Prior to output or perhaps prior to assignment, we will want to sort this list by Delta.score to preserve order and/or cull entries we do not need (if a user specifies --all
for instance)
change to Delta(score, new_file, old_file)
in all the places that it was created.
This involves removing 'category' from the Delta object
Our initial scores should be simple for now: an Added should +100 to a score (which is 0 by default). A Removed will simply not add any value to a score. Both cases will also have the factors: 'added' or 'removed' appending to the Delta.factors
During license_diff and other DeltaCode steps, we will simply add values to the score as things are found out about the Delta.
We have means of (crudely) identifying moved files. However, this only looks at the cases where there is a single file per sha1
value.
We need a way of handling files there appears multiple times on either/both sides of scans.
This will need to be broken into smaller tickets
Similar to License Diff, we will need to add a simple (at first) copyright diff function for the delta object.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.