GithubHelp home page GithubHelp logo

nexb / deltacode Goto Github PK

View Code? Open in Web Editor NEW
19.0 16.0 27.0 190.46 MB

DeltaCode: compare two codebase scans (from ScanCode) to detect significant changes.

Home Page: http://www.aboutcode.org/

Batchfile 1.75% Python 94.74% Shell 2.64% Dockerfile 0.45% Makefile 0.40%
scancode oss-compliance software-licensing deltacode

deltacode's People

Contributors

agustinhenze avatar arijitde92 avatar arnav-mandal1234 avatar ayansinhamahapatra avatar chinyeungli avatar hritik14 avatar johnmhoran avatar jonoyang avatar keshav-space avatar mjherzog avatar pombredanne avatar pratikrocks avatar purna135 avatar steven-esser avatar swastkk avatar tg1999 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deltacode's Issues

Cloning repo on Windows 10 creates 'tcl' directory

When the deltacode repo is cloned on Windows 10, followed by running ./configure.bat --clean and then ./configure.bat, a tcl directory appears after the the last step has finished. My local version of the former spats-deltacode repo does not contain a tcl directory, and I didn't encounter that directory during my previous work in spats-deltacode. After checking out a new branch and launching Visual Studio Code, VSCode indicates that there are 976 "pending changes" -- all of them evidently in this tcl directory.

My current resolution: I've deleted the tcl directory from inside my local branch.

Add deltas_count field

We should have a deltas_count field in JSON output, similar to scancode-toolkit

May need some discussion as to how to handle the -a option

Add end-to-end tests

We need end-to-end tests for both our outputs in various scenarios. This probably can be addressed along with #37, as the end-to-end could be incorporated via mock cli calls.

This ticket stems out of a few minor issues I've run into after we do refactoring. Having end-to-end tests in place makes sure that as we refactor and add features, cli workflows are not broken by our changes.

Check and determine ScanCode options present in a ScanCode data file

In order to compare codebases accurately, Deltacode needs scancode data files that have the full file information available, at the very least.

We also need to know what additional scancode options each Deltacode input file has in order to figure out what other scan data is present and therefore what type of license, copyright or other data we can compare.

implement release process

We copy the release.sh from scan code-toolkit

  • remove references to scancode
  • replace with deltacode

Add in additional file-level info for json results

  1. Update License object by removing unnecessary fields.
    a) update License.to_dict() as well
  2. Update License object and License.to_dict() tests
  3. Update Delta object and Delta.to_dict() tests
  4. Update remaining json-based tests (if needed)

Match scoring

When we have a matching set of files in deltacode, we need to have some way of scoring or weighing a match.

This will also come into play more when we start to incorporate license and copyright changes.

This score will ultimately take the place of modified string in our match object.

Add factors to Delta Object

add a 'factors' in lieu of 'category'.

The goal here is to append various 'factors' as we run our codebase thru the various DeltaCode steps (determine_delta moved etc), primarily so we do not have to adjust category often.

if there are relevant factors that go into a particular File pair in a Delta object (a license addition or change, for instance), then than information will simply be appending the this 'factors' list.

When we go to output, all the Deltas will be sorted in descending order by score and we can simply dump the contents of a Delta's 'factors' field into a cell (in the csv case).

AssertionError when running DeltaCode on eCos scans

I ran ScanCode with the the following options (-clipeu) on version 2.0 of eCos and the latest HEAD of the eCos CVS repo. After, I ran DeltaCode on the report files and I got the following issue:

$ deltacode -n ecos-head.json -o ~/Desktop/ecos-2.0-linux.json -c delta.csv
Traceback (most recent call last):
  File "/home/jono/nexb/tools/develop/deltacode/bin/deltacode", line 11, in <module>
    load_entry_point('deltacode', 'console_scripts', 'deltacode')()
  File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/jono/nexb/tools/develop/deltacode/src/deltacode/cli.py", line 80, in cli
    delta = DeltaCode(new, old)
  File "/home/jono/nexb/tools/develop/deltacode/src/deltacode/__init__.py", line 23, in __init__
    self.deltas = self.determine_delta()
  File "/home/jono/nexb/tools/develop/deltacode/src/deltacode/__init__.py", line 103, in determine_delta
    assert len(deltas) == ((self.new.files_count - new_nonfiles) + (self.old.files_count - old_nonfiles) - modified - unchanged)
AssertionError

Attached are the input files I used to get this error:
ecos-scans.zip

Handle 'moved' files

This will need some thinking, but we will want some way to tell if a file has been 'moved' between the new and old scans of some codebase.

This means the sha1 should be matching, but the path would not be. There are also cases where the same file could be present in multiple locations.

We may need to index the files by sha1, similar to what we did in determine_delta

Add filetype info (or similar) to the output.

Along with type and path, we should add filetype (or something similar) to ease filtering of results. Ofter we do not care about config files or Makefiles when analyzing a codebase.

Add deltacode and deltacode entrypoint script

Similar to scancode, we need to add a deltacode and ./deltacode.bat top level script.

This entrypoint script allows a user to simply run ./deltacode after cloning or downloading our deltacode repo. Its main responsibility is handle the initial configuration automatically.

This makes it easier from and end-user point of view. We will also want to include this in our release script.

Identify version changes

[@mjherzog comment] A significant subset of the Added/Removed files in a DeltaCode comparison are likely due to a version change for the same component. This will be complex to solve because the version number may be embedded in the path for a source code directory and/or in the filename for a Development or Deployment component. But this will also be very valuable because upgrading component versions between product releases is extremely common.

Collect errors

[@MaJuRG comment] stop printing error messages to the console, and log the errors instead in the output.

Determine Score implementation

related to: #3

For now, should should simply add small values to our score at the different times. Later on we can think about subtracting for lack of info (license or otherwise) or Permissive licenses etc

Simplify deltacode output

For the csv output, it would be a better presentation if we simply included a single path value, instead of empty or repeated paths that are redundant.

use license scan information to further determine modified.

@mjherzog brought this up in a recent call.

For the 'modified' set of files we should look at the license scan information to further distinguish the type of modification.

For example, two files can effectively be marked unmodified if the license scan information (i.e. license key and/or expression) is the same for both File objects. This type of check is related to #3 as well.

Consolidate scoring in 'Delta' method

This issue continues the work started with issue #52, in which a hard-coded score is assigned to the new Delta.score attribute when the various Delta categories are created, i.e., in:

  • DeltaCode.determine_delta(),
  • DeltaCode.update_deltas() and
  • Delta._license_diff().

empty path strings in outputs appear after some alignments

When running

deltacode -n tests/data/deltacode/ecos-failed-counts-assertion-new.json -o tests/data/deltacode/ecos-failed-counts-assertion-old.json -c ~/test.csv

Many modified or license change Deltas show an empty path value in our output. I believe this happens as a result of align_scan(), but I could be wrong. I will investigate further and post more details.

Add date and platform to JSON output

[@johnmhoran comment] In terms of record-keeping, a user might find it helpful if the JSON output includes the date/time at which the JSON was generated and the platform on which DeltaCode was run.

Make `_license_diff` its own deltacode function.

Similar to determine moved, we move this to DeltaCode.

The primary purpose of this function is simply to modify the Delta score field, depending on the license information between two Files.

We can keep it simple for now, and use a similar algorithm to that currently in _license_diff()

Migrate to using json-to-csv script in lieu of separate formatting option

Instead of maintaining two different 'branches' of output, we should default to json only.

We will still want a way to view the results in csv form, but this can be moved to a separate script that only takes a deltacode json output and converts it to csv.

UX should not change: -c output should still output a csv file. It is in the internals that we would just make a call to our json2csv script instead of write_csv

Collect/calculate additional statistics

From @MaJuRG We can get basic counts currently in deltacode; we need to expand that to additional calculations like % added, removed, etc, % changed, perhaps some sort of codebase 'drift' calculation that incorporates a number of different stats.

Add cli tests

Use the scancode cli.py tests as a model. We mainly want to verify correct cli output in certain scenarios.

Add windows builds

We currently only have linux builds running via Travis CI. We should have all platforms.

Refactor 'generate_csv()'

I think we might be able to reduce the 38 or so lines used in generate_csv() to construct the .csv tuple down to around 13 lines by using a ternary/conditional expression, e.g.,

new = '' if delta['category'] == 'removed' else delta['new']['path']

Initial testing suggests this works as expected -- all 45 tests pass. The refactoring would be applied here:

for delta in deltas:
category = delta['category']
if delta['category'] == 'added':
new = delta['new']['path']
new_filename = delta['new']['name']
new_sha1 = delta['new']['sha1']
new_size = delta['new']['size']
new_type = delta['new']['type']
new_orig = delta['new']['original_path']
old = ''
old_filename = ''
old_sha1 = ''
old_size = ''
old_type = ''
old_orig = ''
elif delta['category'] == 'removed':
new = ''
new_filename = ''
new_sha1 = ''
new_size = ''
new_type = ''
new_orig = ''
old = delta['old']['path']
old_filename = delta['old']['name']
old_sha1 = delta['old']['sha1']
old_size = delta['old']['size']
old_type = delta['old']['type']
old_orig = delta['old']['original_path']
else:
new = delta['new']['path']
new_filename = delta['new']['name']
new_sha1 = delta['new']['sha1']
new_size = delta['new']['size']
new_type = delta['new']['type']
new_orig = delta['new']['original_path']
old = delta['old']['path']
old_filename = delta['old']['name']
old_sha1 = delta['old']['sha1']
old_size = delta['old']['size']
old_type = delta['old']['type']
old_orig = delta['old']['original_path']

Failing case if extra directory is added

I did some simple tests and here is my finding:

I use "balloontip-1.1.1.jar" as a sample file.

  1. Created 2 directories d1/ and d2/ and put the test file in it and then compare these 2 directories. The output is unchanged which is correct.

  2. Same setup as (1) but create a new subdirectory named test/ under d1/ and put balloontip-1.1.1.jar in it.
    Both the
    d1/balloontip-1.1.1.jar
    d1/test/balloontip-1.1.1.jar
    are returned as added.

and the d2/balloontip-1.1.1.jar is returned as removed

which is not correct as the d1/balloontip-1.1.1.jar and d2/balloontip-1.1.1.jar should return unchanged while the d1/test/balloontip-1.1.1.jar is consider as added.

  1. Same setup as (1) but create a new root/ directory and put the d1/ in it and run the deltacode from root/ to d2/. The output is unchanged which is correct.

Adjust License Diff to account for License Category

Auditors care more about Copyleft and Proprietary licenses showing up in a codebase. We need to adjust our scoring so that more emphasis is give to:

  • 'no license' -> 'copyleft limited or higher'
  • 'permissive' -> 'copyleft limited or higher'
  • 'anything -> 'proprietary/commercial'

There are probably other combinations here as well.

increment counter in deteremine_delta()

Instead of decrementing a counter in determine_delta(), we should increment it and check against files_count and the end, instead of 0

We should also add some error message when the assertion fails.

Flatten DeltaCode.deltas field

Once we have the basics of scoring (#52), we can move on to flattening the deltas field of DeltaCode. This means moving on a dictionary of lists to just a single list of Delta objects.

Prior to output or perhaps prior to assignment, we will want to sort this list by Delta.score to preserve order and/or cull entries we do not need (if a user specifies --all for instance)

Only pass score during Delta creation.

change to Delta(score, new_file, old_file) in all the places that it was created.

This involves removing 'category' from the Delta object

Our initial scores should be simple for now: an Added should +100 to a score (which is 0 by default). A Removed will simply not add any value to a score. Both cases will also have the factors: 'added' or 'removed' appending to the Delta.factors

During license_diff and other DeltaCode steps, we will simply add values to the score as things are found out about the Delta.

Identify dupes

We have means of (crudely) identifying moved files. However, this only looks at the cases where there is a single file per sha1 value.

We need a way of handling files there appears multiple times on either/both sides of scans.

This will need to be broken into smaller tickets

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.