jensdietrich / shadedetector Goto Github PK

License: Other

Java 99.64% HTML 0.17% Shell 0.04% Perl 0.10% Makefile 0.05%

shadedetector's Introduction

ShadeDetector -- A Tool to Detect Vulnerabilities in Cloned or Shaded Components

Overview

The tool takes the coordinates of Maven artifact (GAV - GroupId + ArtifactId + Version) and a testable proof-of-vulnerability (TPOV) project as input, and will infer and report a list of artifacts that are cloning / shading the input artifact, and are also exposed to the same vulnerability. For each such artifact, a TPOV is constructed from the original TPOV, proving the presence of the vulnerability.

Testable Proof-of-vulnerability Projects (TPOV)

The Structure of a TPOV

TPOVs make a vulnerability testable. Each TPOV has the following structure:

a TPOV is a simple (i.e. non-modular) Maven project.
a TPOV has a dependency on the vulnerable artifact.
a TPOV has a test-scope dependency on JUnit5, other dependencies should be avoided or minimised.
a TPOV has one or more tests that either all succeed or all fail if and only if the vulnerability can be exploited -- i.e. the vulnerability becomes the test oracle. Those tests may be the only classes defined in a TPOV. The test outcome (success or failure) that indicates vulnerability is specified by the testSignalWhenVulnerable element of its pov-project.json metadata file.
a TPOV test may declare dependencies on certain OS or JRE versions using standard JUnit annotations such as @EnabledOnOs or @EnabledOnJre.
sources in a TPOV should not directly use fully classified class names, instead, imports should be used (this is to aid the tool to automatically refactor dependencies).

Sourcing TPOVs

some TPOVs can be found here: https://github.com/jensdietrich/xshady/
there are numerous proof-of-vulnerability (POV) projects on GitHub, such as https://github.com/frohoff/ysoserial; usually those projects need to be modified to make them TPOVs as described above
this is a collection of POVs: https://github.com/tuhh-softsec/vul4j, see also *Bui QC, Scandariato R, Ferreyra NE. Vul4J: a dataset of reproducible Java vulnerabilities geared towards the study of program repair techniques. MSR'22.

Building

The project must be built with Java 11 or better. To build run mvn package. This will create the executable shadedetector.jar in /target.

Running

usage: java -cp <classpath> Main [-a <arg>] [-bc <arg>] [-bs <arg>] [-c <arg>] [-cache <arg>] [-env <arg>] [-fa <arg>] [-fc <arg>] [-fdm <arg>] [-g
       <arg>] [-l <arg>] [-mcc <arg>] [-msc <arg>] [-o <arg>] [-o1 <arg>] [-o2 <arg>] [-o3 <arg>] [-pl <arg>] [-ps <arg>] [-r <arg>] [-s <arg>] [-sig
       <arg>] [-v <arg>] [-vg <arg>] -vov <arg> -vul <arg> [-vv <arg>]
Arguments:

 -a,--artifact <arg>                      the Maven artifact id of the artifact queried for clones (default read from PoV's pov-project.json)
 -bc,--batchcount <arg>                   the number of by-class REST API search query batches per candidate (optional, default is 5)
 -bs,--batchsize <arg>                    the maximum number of rows requested in each by-class REST API search query batch (optional, default is 200)
 -c,--clonedetector <arg>                 the clone detector to be used (optional, default is "ast")
 -cache,--cachedir <arg>                  path to root of cache folder hierarchy (default is ".cache")
 -env,--testenvironment <arg>             a property file defining environment variables used when running tests on generated projects used to verify
                                          vulnerabilities, for instance, this can be used to set the Java version
 -fa,--filterartifacts <arg>              a regex restricting the artifact GAVs to be considered (non-matching GAVs will be discarded). For debugging.
 -fc,--filterclassnames <arg>             a regex restricting the class names to be considered (non-matching class names will be discarded). For
                                          debugging.
 -fdm,--finaldirmode <arg>                how to construct the contents of the final directory specified with -vov (optional, one of COPY, SYMLINK,
                                          OLD_UNSAFE_MOVE_AND_RETEST; default is COPY)
 -g,--group <arg>                         the Maven group id of the artifact queried for clones (default read from PoV's pov-project.json)
 -l,--log <arg>                           a log file name (optional, if missing logs will only be written to console)
 -mcc,--minclonedclasses <arg>            the minimum number of classes detected as clones needed to trigger compilation and testing (optional,
                                          default is 11)
 -msc,--maxsearchclasses <arg>            the maximum number of class names to search via the REST API per candidate (optional, default is 10)
 -o,--output <arg>                        the component used to process and report results (optional, default is "log")
 -o1,--output1 <arg>                      an additional component used to process and report results
 -o2,--output2 <arg>                      an additional component used to process and report results
 -o3,--output3 <arg>                      an additional component used to process and report results
 -pl,--povlabel <arg>                     the label for this PoV (output will go under a subdir having this name; default is the basename of the path
                                          specified with -vul)
 -ps,--stats <arg>                        the file to which progress stats will be written (default is "stats.log")
 -r,--resultconsolidation <arg>           the query result consolidation strategy to be used (optional, default is "moreThanOne")
 -s,--classselector <arg>                 the strategy used to select classes (optional, default is "complexnames")
 -sig,--vulnerabilitysignal <arg>         the test signal indicating that the vulnerability is present, must be of one of: SUCCESS,FAILURE,ERROR
                                          (default read from testSignalWhenVulnerable in PoV's pov-project.json)
 -v,--version <arg>                       the Maven version of the artifact queried for clones (default read from PoV's pom.xml)
 -vg,--vulnerabilitygroup <arg>           the group name used in the projects generated to verify the presence of a vulnerability (default is "foo")
 -vov,--vulnerabilityoutput_final <arg>   the root folder where for each clone, a project created in the build cache folder will be
                                          copied/symlinked/moved if verification succeeds (i.e. if the vulnerability is shown to be present)
 -vul,--vulnerabilitydemo <arg>           a folder containing a Maven project that verifies a vulnerability in the original library with test(s), and
                                          can be used as a template to verify the presence of the vulnerability in a clone; values for -g, -a, -v and
                                          -sig are read from any contained pov-project.json
 -vv,--vulnerabilityversion <arg>         the version used in the projects generated to verify the presence of a vulnerability (default is "0.0.1")

Setting the Environment

With -env an environment can be set to be used to build / test the TPOVs. If TPOV tests require a Java version different from the one used to run the tool, this can be used to set JAVA_HOME to point to a particular version of the Java Development Kit (JDK, not just JRE as TPOVs are compiled).

Known Issues

In principle the tool can be run with Java 11. However, we did encounter rare cases where the analysis gets stuck and eventually fails with an OutOfMemoryError. This seems to be caused by a bug in the zip file system in Java 11. We recommend using Java 17 if this is a problem.

It is also possible to add artifacts to nz.ac.wgtn.shadedetector.Blacklist to exclude them from the analysis.

Caching

To (dramatically) speed up subsequent runs, shadedetector caches:

Maven Central Repo REST API queries, by default in 5 batches of up to 200 results each
Candidate clone artifact pom.xmls and sources
Maven builds and test results of clone TPOVs

By default, all caching is done under the directory .cache in the current directory, but this can be changed with -cache.

NOTE: Multiple concurrent invocations of shadedetector will cooperate in using the cache safely -- but only if the cache root directory is on a local filesystem. NFS and possibly other network-based filesystems lack the guarantees of atomicity needed. A local filesystem will be much faster in any case.

Final output directory

By default, for each vulnerable cloned artifact, shadedetector copies its cached build directory into <finalDir>/<povLabel>/<safeName>, where <finalDir> is the directory specified with -vov, <povLabel> is the label for the vulnerability (which defaults to the basename of the path given to -vul but can be changed with -povlabel) and <safeName> is a name constructed from the vulnerable GAV by replacing colons with ... For doing large batches of runs, specifying --finaldirmode SYMLINK will instead symlink to the cached build dirs, saving disk space.

Customising / Extending

Several strategies are implemented as pluggable services. I.e. strategies are described via interfaces, with service providers declared in library manifests, see for instance src/main/resources/META_INF/services for the onboard default providers. Each provider has a unique name that can be used as an argument value in the CLI. All interfaces are defined in nz.ac.wgtn.shadedetector. The service is selected by a factory nz.ac.wgtn.shadedetector.<Service>Factory that also defines what is being used as the default service provider.

Service	Interface	CLI Argument(s)	Description	Default
result reporter	`ResultReporter`	`-o`,`-o1`,`-o2`,`-o2`	consumes analysis results, e.g. to generate reports	report results using standard log4j logging
class selector	`ClassSelector`	`-s`	selects the classes from the input artifact to be used to query Maven for potential clones	pick 10 classes with the highest number of camel case tokens (i.e. complex class names)
clone detector	`CloneDetector`	`-c`	the clone detector used to compare two source code files (from the input artifact and a potential clone)	custom AST-based clone detection that ignores comments and package names in type references
consolidation strategy	`ArtifactSearchResultConsolidationStrategy`	`-r`	the strategy used to consolidate artifact sets obtained by REST queries for a single class into a single set	an artifact must appear in at least two sets

Some services can be customised further by setting properties (corresponding to bean properties in the respective service provider classes). For instance, consider the following arguments setting up output reporting:

  -o csv.details?dir=results/details/CVE-2022-45688-commonstext -o1 csv.summary?file=results/summary-CVE-2022-45688-commonstext.csv

This sets up two reporters named csv.details (corresponding to nz.ac.wgtn.shadedetector.resultreporting.CSVDetailedResultReporter) and csv.summary (corresponding to nz.ac.wgtn.shadedetector.resultreporting.CSVSummaryResultReporter), respectively. This is followed by a configuration consisting of &-separated key-value pairs, setting properties of the respective instance. In this case, the files / folders where reports are to be generated are set.

shadedetector's People

Contributors

Watchers

shadedetector's Issues

check CVE-2016-6802 results

atm the initial query returns zero results. Assuming that the original vulnerable component is in the maven repo, this would indicate that the class index is incomplete. Please double check.

Extend `-sig auto` to read GAV from `pov-project.json` too

-sig auto already allows the user to read the value of testSignalWhenVulnerable from the original PoV's pov-project.json, instead of having to specify it on the command line which is tedious and error-prone. This mechanism could be extended to read more metadata from pov-project.json, like the GAV coords of the vulnerable artifact, to save having to specify them on the command line with -g, -a, -v.

Proposal: Get rid of -sig auto and automatically read G, A, V and testSignalWhenVulnerable values from the PoV specified with -vul whenever it has a pov-project.json. Command-line params like -a, etc., can still be supplied to override those values (and will remain necessary for PoVs lacking a pov-project.json).

765 of 938 class names searched appear in > 1000 artifacts

With our current limit of 1000 artifacts, this means (a) we miss a large fraction of containing artifacts for the majority of class names, and (b) there is a good chance that we might see a different subset of 1000 artifacts each time the query is run.

In fact, 355 class names appear in over 10000 artifacts, and 10 appear in over 100000:

wtwhite@wtwhite-vuw-vm:~/code/shadedetector/.cache/artifacts-using-classes$ ls *.json|wc -l
938
wtwhite@wtwhite-vuw-vm:~/code/shadedetector/.cache/artifacts-using-classes$ for f in *.json; do jq .response.numFound $f; done | perl -lne 'print if $_ > 1000' | wc -l
765
wtwhite@wtwhite-vuw-vm:~/code/shadedetector/.cache/artifacts-using-classes$ for f in *.json; do jq .response.numFound $f; done | perl -lne 'print if $_ > 10000' | wc -l
355
wtwhite@wtwhite-vuw-vm:~/code/shadedetector/.cache/artifacts-using-classes$ for f in *.json; do jq .response.numFound $f; done | perl -lne 'print if $_ > 100000' | wc -l
10

The most frequent of these classes appear in over 200000 artifacts:

wtwhite@wtwhite-vuw-vm:~/code/shadedetector/.cache/artifacts-using-classes$ for f in *.json; do jq .response.numFound $f; done | sort -nr | head
214868
214868
214868
214868
214868
211143
211143
211143
211143
211143
wtwhite@wtwhite-vuw-vm:~/code/shadedetector/.cache/artifacts-using-classes$ for f in *.json; do jq .response.numFound $f; done | sort -nr | tail
38
38
38
38
38
13
13
13
13
13

Proposal: Class names that appear extremely frequently are inherently low-value. We could completely discard from consideration all class names that appear in more than, say, 1000 artifacts, by detecting this from numFound in the first REST API batch response and discarding that class name at that point. This approach is basically "TF-IDF lite". It could be a win-win-win:

We avoid most of the slowest REST API queries. Currently we run exactly 5 queries per class name, but this change will turn many of those into single queries -- and the remaining batches of 5 queries, which by definition will find <= 1000 results, could be sped up to make just the number of queries actually required as per #23. This time savings could be used to increase the threshold, e.g., to 10000.
For class names appearing in over 1000 artifacts, we avoid any run-to-run random variation in which subset of 1000 artifacts are retrieved
These extremely common class names are likely to be very weak, or even misleading, signals of cloning anyway.

Try heuristically guessed possible locations for artifact source code

429 java.lang.IllegalStateException: no source code found for artifact from #32 shows that we frequently miss downloading the sources for artifacts. If the source location suffix specified in the REST API responses doesn't work, try alternative locations (e.g., appending -src.jar instead of -source.jar) -- the Maven Central Repo's record could be wrong.

All PoVs with `testSignalWhenVulnerable == 'failure'` are being ignored

main() calls MVNProjectCloner.cloneMvnProject(), which winds up running mvn test. If the cloned artifact is vulnerable (i.e., the interesting case) and the PoV has testSignalWhenVulnerable == 'failure', the test will fail, causing mvn test to fail with error code 1 -- which currently leads to the artifact candidate not being marked as TESTED and thus skipped from further processing in main().

As discussed with @jensdietrich, in the age of testSignalWhenVulnerable, we need to look at its value, as well as the actual Surefire test results, instead of just the exit code.

Currently we don't know any definite false negatives to test with. But a true positive useful for regression testing is given by:

time java -jar target/shadedetector.jar -g org.apache.tika -a tika-parsers -v 1.18 -vul ../xshady/CVE-2018-8017 -sig success -l log6-CVE-2018-8017.log -vos /home/whitewa/code/shadedetector/vuln_staging -vov /home/whitewa/code/shadedetector/vuln_final

which (after 2.5 hours) produced the following vulnerable clones:

[whitewa@piccolo ~/code/shadedetector]$ ls vuln_final/
com.github.lafa.tikaNoExternal__tika-parsers__1.0.0   com.github.lafa.tikaNoExternal__tika-parsers__1.0.15  com.github.lafa.tikaNoExternal__tika-parsers__1.0.5
com.github.lafa.tikaNoExternal__tika-parsers__1.0.1   com.github.lafa.tikaNoExternal__tika-parsers__1.0.16  com.github.lafa.tikaNoExternal__tika-parsers__1.0.6
com.github.lafa.tikaNoExternal__tika-parsers__1.0.10  com.github.lafa.tikaNoExternal__tika-parsers__1.0.17  com.github.lafa.tikaNoExternal__tika-parsers__1.0.7
com.github.lafa.tikaNoExternal__tika-parsers__1.0.11  com.github.lafa.tikaNoExternal__tika-parsers__1.0.18  com.github.lafa.tikaNoExternal__tika-parsers__1.0.8
com.github.lafa.tikaNoExternal__tika-parsers__1.0.12  com.github.lafa.tikaNoExternal__tika-parsers__1.0.2   com.github.lafa.tikaNoExternal__tika-parsers__1.0.9
com.github.lafa.tikaNoExternal__tika-parsers__1.0.14  com.github.lafa.tikaNoExternal__tika-parsers__1.0.4

I'd like to add a test that fails before and passes after this change, but there's some time pressure so that may have to wait.

Investigate 2 other versions of `org.apache.servicemix.bundles:org.apache.servicemix.bundles.commons-collections`

From github/advisory-database#2841 (comment)

For org.apache.servicemix.bundles:org.apache.servicemix.bundles.commons-collections it appears that there are three versions on maven
https://central.sonatype.com/artifact/org.apache.servicemix.bundles/org.apache.servicemix.bundles.commons-collections/3.2.1_3/versions
All of which look to have a copy of the commons collections to me.

Too many scripts for running SCA tools, and producing summary tables

3 different repos have some type of script for running the 4 existing SCA tools, which is confusing.

For example, the history across all 3 repos of changes to scripts that run the OWASP Dependency-Check tool (which originally used their binary, but switched to using their Maven plugin):

wtwhite@wtwhite-vuw-vm:~/code/xshady-prerelease$ alias log='git log --follow --pretty="%ad $(basename $(git rev-parse --show-toplevel)) %h %s" --date=iso8601 --'
wtwhite@wtwhite-vuw-vm:~/code/xshady-prerelease$ ( ( cd ~/code/xshady-prerelease; log run-owasp.sh ); ( cd ~/code/xshady; log run-owasp.sh ); ( cd ~/code/shadedetector; log sca/run-owasp-dependencycheck.sh ) ) | sort
2023-04-14 10:41:42 +1200 shadedetector ea69f51 added scripts to run snyk and owasp dependency plugin on all generated projects
2023-04-25 14:06:09 +0200 shadedetector a947a7c Move SCA scripts outside resources
2023-05-03 06:53:06 +0200 xshady-prerelease d916228a added owasp reports and script
2023-05-11 17:32:33 +1200 xshady-prerelease 10e72ff4 updates sca analysis results
2023-05-13 10:43:40 +1200 xshady 04cdede updated owasp dependency-check reports (switched to maven owasp dependency check plugin instead of dependency-check cli
2023-05-17 10:34:21 +1200 xshady-prerelease 78bba197 adds new owasp reports, new reports for CVE-2015-6420, summary scripts
2023-05-25 09:49:00 +1200 shadedetector 72f4366 moved code one level up
2023-10-02 20:29:59 +1300 shadedetector a4e952b Handle CVE-* subdirs for OWASP Dependency Check

I propose making sure that sca/Makefile in this repo (shadedetector) does everything correctly (including being able to work on both the original TPoVs in the xshady repo, and newly discovered PoVs), and then deleting all these scripts to reduce confusion.

Relatedly, there are 2 different scripts to generate summary tables of results:

https://github.com/jensdietrich/xshady-prerelease/blob/main/scripts/src/main/java/Report.java in xshady-prerelease, for newly discovered PoVs
https://github.com/jensdietrich/xshady/blob/main/run-summary.sh in xshady, for original TPoVs

They look to be doing very similar things, but there could be subtle differences (one is Java, one is shell that calls jq).

fetching sourcecode sometimes fetches hashes instead of jars

example:
.cache/src/com.github.kayjamlang/executor/0.1.3.17-fix3/executor-0.1.3.17-fix3-sources.jar.sha256

Increase timeout on slow classname search REST API

The https://search.maven.org/solrsearch/select endpoint that shadedetector uses to find definitions of a given classname in the Maven central repo can be very slow, even on repeated queries, and this causes timeout errors. I can repro with curl -- I see successful responses that take up to 19s:

C:\Users\walto\Documents\code\shadedetector>curl "https://search.maven.org/solrsearch/select?q=c%3AXmlTypeResolverBuilder&wt=json&rows=200&start=201" > retry_shadedetector_with_curl.out
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 90112    0 90112    0     0   4522      0 --:--:--  0:00:19 --:--:-- 19770

C:\Users\walto\Documents\code\shadedetector>curl "https://search.maven.org/solrsearch/select?q=c%3AXmlTypeResolverBuilder&wt=json&rows=200&start=201" > retry_shadedetector_with_curl2.out
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 90112    0 90112    0     0   7505      0 --:--:--  0:00:12 --:--:-- 21153

C:\Users\walto\Documents\code\shadedetector>

(I originally thought the issue was be the networking setup of the Linux VM I'm using, but could repro in the Windows host.)

Make `findVersions()` caching atomic

Although by-class search caching is now cross-process atomic, by-group-and-artifact search caching isn't yet. Fixing this is low-priority however, since this code path is run only once, at the start of the program, on the particular PoV supplied on the command line -- and under normal usage that will be unique per shadedetector invocation.

Some temp dirs are not being deleted

Exactly 284 temp dirs were left in each of 2 early PoVs (CVE-2013-2186 and CVE-2014-0050) in a run -- so 568 dirs in total. These 2 PoVs created 1228 temp dirs in total, the rest of which were deleted. All temp dirs created while processing the remaining 27 PoVs were also deleted.

[whitewa@piccolo /local/scratch/whitewa/shadedetector/.cache]$ find . -maxdepth 1 -name 'ziptmp*' -mmin +60|perl -lpe 's/^..//' > dirs_that_should_have_been_auto_deleted.txt
[whitewa@piccolo /local/scratch/whitewa/shadedetector/.cache]$ cd ~/code/shadedetector/runs
[whitewa@piccolo ~/code/shadedetector/runs]$ grep -Ff /local/scratch/whitewa/shadedetector/.cache/dirs_that_should_have_been_auto_deleted.txt */log*|grep Unzipping|cut -d: -f1|uniq -c
    284 22_all_except_CVE-2017-18349_and_CVE-2018-8017/log100-CVE-2013-2186.log
    284 22_all_except_CVE-2017-18349_and_CVE-2018-8017/log102-CVE-2014-0050.log
[whitewa@piccolo ~/code/shadedetector/runs]$ grep Unzipping 22_all_except_CVE-2017-18349_and_CVE-2018-8017/log10[02]*.log|wc -l
1228

I suspect the cause is either that there is some strange filename common to these temp dirs, or an open filehandle to them. Fixing this is not a high priority as it's easy to delete them by hand later.

Feature idea: Compare candidate source files against all vulnerable versions of template

Often there are multiple vulnerable versions of an artifact. These versions will likely have differences in some source files -- and a clone could have been made from any one of them.

At the moment, we choose one vulnerable version (generally the latest) of an artifact as the template to put in the xshady repo, and hope that clones were either cloned from that version, or that enough files were not materially (w.r.t. ASTBasedCloneDetector) changed between the chosen version and the version that was cloned -- but this could miss clones of other vulnerable versions.

Example of a hit that was nearly a miss: https://github.com/jensdietrich/xshady-prerelease/issues/29 caught io.takari:commons-compress:1.12 as a clone of org.apache.commons:commons-compress:1.15 despite only 23 of 162 same-named source file pairs being AST-equivalent. Based on the version numbers, it's likely that the former is a clone of version ~~1.15~~1.12 of the latter; had a few more files been changed between versions 1.12 and 1.15 of org.apache.commons:commons-compress, the count would have dropped below 11 and this clone would have been missed. OTOH, it's likely that, had we used version 1.12 of org.apache.commons:commons-compress as the template, a larger number of source files would have been found AST-equivalent.

Implementation note: We only need to compare a candidate source file with each distinct vulnerable version of the target file. This could save a lot of work since many source files will not change across versions.

`org.hudsonci.lib.guava:guava` duplicated in GHSA PR

First on line 116 and then also on line 135.

Experiment with internal Maven Central Repo REST APIs

@jensdietrich's student Mohammad found some internal Sonatype REST APIs that might be useful/more reliable than the external ones we're currently using -- see email from 26-09-2023.

Currently there are internal APIs for retrieving package versions and dependencies. The most useful would be an API for finding all artifacts using a class with a given name, as this is the flakiest API currently, and when it fails it has the most impact on our results -- but we may have to experiment to see if we can discover it ourselves.

Sources for CVE-2016-6802 PoV contain no Java code

Noticing that log110-CVE-2016-6802.log was very short and contains 0 potential matches found, I looked at the downloaded source for its PoV, org.apache.shiro:shiro-all:1.3.1, and found that it contains no Java code at all -- just some META-INF files:

wtwhite@wtwhite-vuw-vm:~/code/shadedetector/runs/2_first_big_run$ unzip -l /home/wtwhite/code/shadedetector/.cache/src/org.apache.shiro/shiro-all/1.3.1/shiro-all-1.3.1-sources.jar
Archive:  /home/wtwhite/code/shadedetector/.cache/src/org.apache.shiro/shiro-all/1.3.1/shiro-all-1.3.1-sources.jar
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2016-08-19 13:42   META-INF/
      136  2016-08-19 13:42   META-INF/MANIFEST.MF
     3278  2016-08-19 13:42   META-INF/DEPENDENCIES
    11358  2016-08-19 13:42   META-INF/LICENSE
      183  2016-08-19 13:42   META-INF/NOTICE
---------                     -------
    14955                     5 files

The issue is specifically with the sources -- running mvn test in ~/code/xshady/CVE-2016-6802 produces test failure as expected, and rerunning after changing the version to 1.3.2 produces test success as expected.

Something to look into further if we are looking to boost recall.

create service for result reporter

straight forward ..

Improved error handling and process stage logging

Log specific ways on how the pipeline can fail by adding new categories to Main.ProcessingStage:

SOURCES_FETCHED
SOURCES_EXPANDED

Refactor: *TESTED > VULNERABILTY_CONFIRMED

Refactor: POC. > POV

Refactor: make ProcessingStage top-level

Changes to vulnerable versions in TPoVs may impact results

While investigating #57 I noticed that the set of classes queried change after changing the vulnerable version in a TPoV's pom.xml.

Based on the following recent such changes, this could affect results (up or down) for:

CVE-2019-0225
CVE-2022-38749
CVE-2018-10237 (noticed in #57; it decreased the initial number of matches, but no change by the end of the pipeline)
CVE-2022-38751

Check the numbers for each.

wtwhite@wtwhite-vuw-vm:~/code/xshady$ git log -5 --stat -- '**/pom.xml'
commit a53535085c7854dfdd7a251b05ca908d6a60d9ea
Author: Tim White <[email protected]>
Date:   Thu Sep 28 22:31:38 2023 +1300

    Fix last vulnerable version

 CVE-2019-0225/pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

commit 7f5aaac2dfaa03222fdfff950b5fb8432d362b56
Author: Tim White <[email protected]>
Date:   Thu Sep 28 14:31:30 2023 +1300

    Increase vulnerable version in pom.xml to 1.30, and add fixVersion, based on `mvn clean test`
    
    fixVersion based on comment at https://bitbucket.org/snakeyaml/snakeyaml/issues/525/got-stackoverflowerror-for-many-open.

 CVE-2022-38749/pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

commit 850d9b1e186dacdee44327b727763bba861811b9
Author: Tim White <[email protected]>
Date:   Thu Sep 28 14:22:30 2023 +1300

    Increase vulnerable version in pom.xml to 24.1-jre, and add fixVersion, based on `mvn clean test`
    
    fixVersion based on https://nvd.nist.gov/vuln/detail/cve-2018-10237. "-jre" appended based on Guava release naming scheme described at https://github.com/google/guava#adding-guava-to-your-build.

 CVE-2018-10237/pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

commit 52d19fe6b008509ce903d9972e0061e6ff08e92c
Author: Tim White <[email protected]>
Date:   Thu Sep 28 09:32:07 2023 +1300

    Update pom.xml to later version (1.30) that is still vulnerable

 CVE-2022-38751/pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

commit 013ffab66459e87bac86ab424a3a30b3cb5acaf9
Author: Tim White <[email protected]>
Date:   Mon Sep 4 12:15:31 2023 +1200

    Add CVE-2016-0779 PoV. Fail->Succeed on 1.7.3->1.7.4 and 7.0.0-M2->7.0.0-M3.

 CVE-2016-0779/pom.xml | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)

Emptying blacklist results in exceptions

Emptying the blacklist now results in exceptions like the following, which nevertheless do not prevent the rest of the analysis from running normally:

00:00:10.569 [main] DEBUG nz.ac.wgtn.shadedetector.ArtifactSearch - Unzipping /local/scratch/whitewa/shadedetector/.cache/src/dev.dejvokep/boosted-yaml/1.1/boosted-yaml-1.1-sources.jar to /local/scratch/whitewa/shadedetector/.cache/ziptmp3006452971802906212, will delete on exit
00:00:10.574 [main] ERROR nz.ac.wgtn.shadedetector.Main - cannot fetch sources for artifact dev.dejvokep:boosted-yaml:1.1
net.lingala.zip4j.exception.ZipException: illegal file name that breaks out of the target directory: ./
        at net.lingala.zip4j.tasks.AbstractExtractFileTask.extractFile(AbstractExtractFileTask.java:56)
        at net.lingala.zip4j.tasks.ExtractAllFilesTask.executeTask(ExtractAllFilesTask.java:41)
        at net.lingala.zip4j.tasks.ExtractAllFilesTask.executeTask(ExtractAllFilesTask.java:17)
        at net.lingala.zip4j.tasks.AsyncZipTask.performTaskWithErrorHandling(AsyncZipTask.java:51)
        at net.lingala.zip4j.tasks.AsyncZipTask.execute(AsyncZipTask.java:45)
        at net.lingala.zip4j.ZipFile.extractAll(ZipFile.java:469)
        at net.lingala.zip4j.ZipFile.extractAll(ZipFile.java:440)
        at nz.ac.wgtn.shadedetector.Utils.extractFromZipToTempDir(Utils.java:126)
        at nz.ac.wgtn.shadedetector.Main.main(Main.java:408)

@jensdietrich: Should we empty the blacklist (and leave these exceptions), try to catch and log exceptions, or leave the blacklist as-is?

No vulnerable artifacts found for `CVE-2015-6420` in latest run

But in the preprint, we found 3 (3 unversioned) -- see Table 3.

Thanks @jensdietrich for noticing this.

shadedetector no longer finds any clones for CVE-2021-44228 and others

https://github.com/jensdietrich/xshady-release/tree/main/CVE-2021-44228 gives some results @jensdietrich already included in the originally submitted paper, showing multiple vulnerable cloned projects, e.g., com.guicedee.services__log4j-core__1.0.20.0-jre8. But my latest runs, on piccolo and my own Linux VM, find no results for this CVE. Where did these results go?

🚨 High priority 🚨

piccolo:

[whitewa@piccolo ~/code/shadedetector]$ zgrep POC_INSTANCE_TESTED_SHADED_unversioned runs/*/stats*.log|grep 44228
runs/22_all_except_CVE-2017-18349_and_CVE-2018-8017/stats123-CVE-2021-44228.log:POC_INSTANCE_TESTED_SHADED_unversioned=0
runs/fourth_run_mostly_successful/stats123-CVE-2021-44228.log:POC_INSTANCE_TESTED_SHADED_unversioned=0

Linux VM:

wtwhite@wtwhite-vuw-vm:~/code/shadedetector$ grep POC_INSTANCE_TESTED_SHADED_unversioned runs/*/stats*.log|grep 44228
runs/3_second_big_run_cache_should_be_warm/stats123-CVE-2021-44228.log:POC_INSTANCE_TESTED_SHADED_unversioned=0

Cache results for more efficient re-running (e.g., under different parameters)

As suggested by @jensdietrich, it would be good to see how results vary when parameters (like the number of REST query results retrieved, minimum number of shared class names, etc.) are changed. To do this efficiently, we would want to avoid recompiling and retesting configurations (candidate Maven projects) that we have already processed by caching them somehow.

Improve automatic updating of GHSA JSON files

Goal: Update GHSA JSON files automatically

As far as possible, we want to automate the work of creating PRs to modify existing GHSAs, since messing around with them manually is time-consuming. tools/update_ghsa.sh goes some way towards this already, but it doesn't know anything about which versions were tested and found not to be vulnerable, which is important as they become fixed entries in the GHSA ranges array. Determining this information will require grepping through shadedetector log files.

(Ideally we could automatically generate actual complete PRs, including standardised descriptions, via the gh pr create command -- but this is out of scope for now. We would need to automatically count dependent projects, among other things.)

Working around OSV Schema and GitHub limitations

GHSAs use the OSV Schema for representing what software is affected by a vulnerability. OSV Schema specifies 2 different ways to give information about which versions of a package are or aren't affected by the vuln:

versions, a plain array of affected versions given as opaque strings.
ranges, a list of version ranges that can also include information about which versions are known to be fixed.

There are 2 relevant limitations:

OSV Schema has no way to indicate only that we have tested up to a certain version, without also specifying (or implying) that later versions are fixed:
- fixed obviously indicates that the given version, and later versions, do not contain the vuln.
- last_affected implies that the next version will be fixed. They acknowledge that this can lead to false negatives and thus recommend using fixed instead wherever possible.
- limit sounds like it might be what we want, but apparently it isn't -- it's something for nonlinear git versions. Indeed, their pseudocode confirms that specifying a limit will cause that and later versions to be considered unaffected by the vuln, leading to false negatives just as for last_affected.
- Simply leaving the range open-ended (that is, with an introduced event only) is probably the safest, since then the worst that can happen is a false positive (a version released later fixes the vuln, but it's still reported as vulnerable). But this records no information about which versions we actually tested.
GitHub doesn't permit multiple versions in the versions array. The suggested workaround from a GitHub employee is to create a separate affected package entry for each version.

The upshot is:

If a package has not been fixed, we have no unambiguous way to record that fact using ranges entries, so it's better to use a plain versions list -- but that isn't supported by GitHub, so we should instead make separate per-version affected package entries, each with a single-item versions array.
If a package has been fixed, we should communicate this fact using a fixed entry in a ranges list. But this then leads to a surprisingly tricky question: How to specify a Maven version range?

Computing "probably correct" Maven version ranges

We want to be able to automatically compute version ranges for GHSAs from the shadedetector results (that is, a partial map from version strings to { Present, Absent }). This is specified by OSV Schema for SEMVER versions, but Maven doesn't require semver, and in practice many packages don't use it (example), so we need to use the ECOSYSTEM range type instead, which leaves version comparisons up to the package manager in question (here, Maven). However it's not clear how Maven version ranges should be specified, since that depends on how Maven compares versions, and:

I can't find anywhere that Maven explicitly specifies how it compares versions (in fact it may not ever need to do so -- most of the time, it can treat a version as just an opaque string, and if it needs the latest or latest stable version because a versionId was not specified, it could just use the LATEST or RELEASE version, which are tracked in per-package maven-metadata.xml files)
Some pages indicate that Maven uses (where?) a convoluted scheme in which certain identifiers like alpha, beta, and milestone are treated specially (it seems there's truth to this).

Nor does GitHub's GHSA contribution guidelines specify how Maven version ranges should be specified. (I also checked for open or closed issues that mention Maven -- but there are only 8, and none are relevant.)

On the bright side, from browsing a few maven-metadata.xml files (1, 2), they so far appear to contain all versions in the right order. This doesn't seem to be explicitly guaranteed anywhere (e.g., nothing about order is mentioned here, here or here; still, if this always holds, then there is a simple approach that probably does the right thing: Just map each version string to its position in that file, and use that position for version comparisons. This is safe since we'll only ever need to deal with versions that are in this list.

Strategy:

Download each package's maven-metadata.xml (or use a cached version) and check whether its <versions> "looks right":
- Keep only the prefix of each version that consists of decimal numbers separated by periods, stripping off the rest (e.g., 1.2.3blah -> 1.2.3, 42-something -> 42)
- ~~Remove adjacent dupes~~ This isn't necessary
- If the result doesn't change after sorting it according to "natural" sort order (like for semver, but generalised to any number of subfields), conclude that the version order in maven-metadata.xml "looks right", otherwise it "looks wrong".
If it does "look right", then use it as the linear version ordering, otherwise complain loudly.

The "looks right" check is not perfect but will catch obvious cases like 2.10 being listed before 2.5.

Dealing with non-contiguous sets of tested versions

With the above linear version ordering, it's straightforward and correct to convert k alternating runs of vuln-present and vuln-absent versions into k fixed-terminated ranges, provided that all tested versions are contiguous. In case they are not contiguous, the safest approach is to flag this as an error, so that I can check the untested versions manually.

Summary of 3 runs: Lots of exceptions occur, especially when trying to download sources

Summary of looking for exceptions in logs with a broad filter, from the first 2 full shadedetector runs on my Linux VM and the last full run on piccolo (using slightly older code). Over half the exceptions result from not finding source code for an artifact.

All runs attempted to process all 29 CVEs currently in xshady, in sequence. Below I'll count an exception as "fatal" if it resulted in no stats.log being output for that CVE.

Linux VM run 1 (finished 05:22 19/9/2023)

647 exceptions in total, 429 of them from not finding/extracting source code
All exceptions in stderr appear in regular logs
12 fatal exceptions (see #30)
/home/wtwhite/code/shadedetector/runs/2_first_big_run on wtwhite-vuw-vm

wtwhite@wtwhite-vuw-vm:~/code/shadedetector/runs/2_first_big_run$ grep -B 1 $'^\tat ' log*|grep -vE $'(^--|\tat )'|wc -l
647
wtwhite@wtwhite-vuw-vm:~/code/shadedetector/runs/2_first_big_run$ grep -B 1 $'^\tat ' nohup.out|grep -vE $'(^--|\tat )'|wc -l
647
wtwhite@wtwhite-vuw-vm:~/code/shadedetector/runs/2_first_big_run$ grep -h -B 1 $'^\tat ' log*|grep -vE $'(^--|\tat )'|perl -lpe 's/(IllegalStateException: no source code found for artifact) \S+/$1/; s/(IOException: failed to download resource from) \S+(.*)/$1 BLAH $2/;'|sort|uniq -c|sort -nr
    429 java.lang.IllegalStateException: no source code found for artifact
     47 java.io.IOException: failed to download resource from BLAH  , response code: 404 - 
     43 net.lingala.zip4j.exception.ZipException: illegal file name that breaks out of the target directory: /
     39 net.lingala.zip4j.exception.ZipException: java.lang.NullPointerException
     28 Caused by: java.net.SocketTimeoutException: timeout
     16 nz.ac.wgtn.shadedetector.ArtifactSearchException: java.io.IOException: failed to download resource from BLAH 
     16 Caused by: java.io.IOException: failed to download resource from BLAH 
     12 nz.ac.wgtn.shadedetector.ArtifactSearchException: java.net.SocketTimeoutException: timeout
      7 net.lingala.zip4j.exception.ZipException: illegal file name that breaks out of the target directory: ./
      4 Caused by: java.lang.NullPointerException: null
      4 
      1 net.lingala.zip4j.exception.ZipException: illegal file name that breaks out of the target directory: ../
      1 java.lang.IllegalStateException: null

Linux VM run 2 (finished 01:00 20/9/2023)

864 exceptions in total, 654 of them from not finding/extracting source code
All exceptions in stderr appear in regular logs
0 fatal exceptions 😅
/home/wtwhite/code/shadedetector/runs/3_second_big_run_cache_should_be_warm on wtwhite-vuw-vm

wtwhite@wtwhite-vuw-vm:~/code/shadedetector/runs/3_second_big_run_cache_should_be_warm$ grep -B 1 $'^\tat ' log*|grep -vE $'(^--|\tat )'|wc -l
864
wtwhite@wtwhite-vuw-vm:~/code/shadedetector/runs/3_second_big_run_cache_should_be_warm$ grep -B 1 $'^\tat ' nohup.out|grep -vE $'(^--|\tat )'|wc -l
864
wtwhite@wtwhite-vuw-vm:~/code/shadedetector/runs/3_second_big_run_cache_should_be_warm$ grep -h -B 1 $'^\tat ' log*|grep -vE $'(^--|\tat )'|perl -lpe 's/(IllegalStateException: no source code found for artifact) \S+/$1/; s/(IOException: failed to download resource from) \S+(.*)/$1 BLAH $2/;'|sort|uniq -c|sort -nr
    654 java.lang.IllegalStateException: no source code found for artifact
     64 java.io.IOException: failed to download resource from BLAH  , response code: 404 - 
     43 net.lingala.zip4j.exception.ZipException: illegal file name that breaks out of the target directory: /
     39 net.lingala.zip4j.exception.ZipException: java.lang.NullPointerException
     26 
     10 net.lingala.zip4j.exception.ZipException: File header and local file header mismatch
      7 net.lingala.zip4j.exception.ZipException: illegal file name that breaks out of the target directory: ./
      5 nz.ac.wgtn.shadedetector.ArtifactSearchException: java.io.IOException: failed to download resource from BLAH 
      5 Caused by: java.net.SocketTimeoutException: timeout
      5 Caused by: java.io.IOException: failed to download resource from BLAH 
      4 Caused by: java.lang.NullPointerException: null
      1 net.lingala.zip4j.exception.ZipException: illegal file name that breaks out of the target directory: ../
      1 java.lang.IllegalStateException: null

Latest piccolo run (finished 13:18 17/9/2023)

646 exceptions in total, 436 of them from not finding/extracting source code.
Logging misses 2 NPEs that occur during zip folder cleanup for CVE-2013-2186 and `CVE-2014-0050 (fixed in #25)
0 fatal exceptions, but note that CVE-2017-18349 and CVE-2018-8017 produced no stats.log due to #21
/home/whitewa/code/shadedetector/runs/22_all_except_CVE-2017-18349_and_CVE-2018-8017 on piccolo

[whitewa@piccolo ~/code/shadedetector/runs/22_all_except_CVE-2017-18349_and_CVE-2018-8017]$ zcat log*.log.gz|grep -B 1 $'^\tat '|grep -vE $'(^--|\tat )'|wc -l
644
[whitewa@piccolo ~/code/shadedetector/runs/22_all_except_CVE-2017-18349_and_CVE-2018-8017]$ cat nohup.out|grep -B 1 $'^\tat '|grep -vE $'(^--|\tat )'|wc -l
646
[whitewa@piccolo ~/code/shadedetector/runs/22_all_except_CVE-2017-18349_and_CVE-2018-8017]$ cat nohup.out|grep -h -B 1 $'^\tat '|grep -vE $'(^--|\tat )'|perl -lpe 's/(IllegalStateException: no source code found for artifact) \S+/$1/; s/(IOException: failed to download resource from) \S+(.*)/$1 BLAH $2/;'|sort|uniq -c|sort -nr
    436 java.lang.IllegalStateException: no source code found for artifact
     66 java.io.IOException: failed to download resource from BLAH  , response code: 404 - 
     43 net.lingala.zip4j.exception.ZipException: illegal file name that breaks out of the target directory: /
     38 net.lingala.zip4j.exception.ZipException: java.lang.NullPointerException
     26 
     11 net.lingala.zip4j.exception.ZipException: File header and local file header mismatch
      7 net.lingala.zip4j.exception.ZipException: illegal file name that breaks out of the target directory: ./
      2 nz.ac.wgtn.shadedetector.ArtifactSearchException: java.io.IOException: failed to download resource from BLAH 
      2 Exception in thread "Thread-1" java.lang.NullPointerException
      2 Caused by: java.net.SocketTimeoutException: timeout
      2 Caused by: java.io.IOException: failed to download resource from BLAH 
      1 net.lingala.zip4j.exception.ZipException: illegal file name that breaks out of the target directory: ../
      1 net.lingala.zip4j.exception.ZipException: Could not create directory: /local/scratch/whitewa/shadedetector/.cache/ziptmp5840283426658941218/android/app
      1 net.lingala.zip4j.exception.ZipException: Could not create directory: /local/scratch/whitewa/shadedetector/.cache/ziptmp5222881746826232742/android/app
      1 net.lingala.zip4j.exception.ZipException: Could not create directory: /local/scratch/whitewa/shadedetector/.cache/ziptmp4958033747685302589/android/app
      1 net.lingala.zip4j.exception.ZipException: Could not create directory: /local/scratch/whitewa/shadedetector/.cache/ziptmp4934135409634896396/android/app
      1 net.lingala.zip4j.exception.ZipException: Could not create directory: /local/scratch/whitewa/shadedetector/.cache/ziptmp245326471249188622/android/app
      1 net.lingala.zip4j.exception.ZipException: Could not create directory: /local/scratch/whitewa/shadedetector/.cache/ziptmp17050054986166925094/com/hazelcast
      1 net.lingala.zip4j.exception.ZipException: Could not create directory: /local/scratch/whitewa/shadedetector/.cache/ziptmp14112192028543659265/android/app
      1 net.lingala.zip4j.exception.ZipException: Could not create directory: /local/scratch/whitewa/shadedetector/.cache/ziptmp12499471128105862574/com/hazelcast
      1 java.lang.IllegalStateException: null
      1 Caused by: java.lang.NullPointerException: null

Notes:

The first grep looks for lines starting with <tab>at to try to find actual exceptions occurring, instead of classes with "exception" in their names.
nohup.out contains intermingled stdout+stdin of the complete runs

Next steps: TBD

composite reporters

We might want to use multiple reporters, e.g. an "aggregating reporter". The main issue is to support this in the cli (Main). Idea: hardcode a few falgs for different reporters, each with its own parameters. I.e. instead of just using something like -r csv?dest=output use -r1 csv?dest=output -r2 summary?dest=output/summary

Colon-related errors when running `mvn package` on Windows

Running mvn package on Windows 11 with Oracle's JDK 17 and Maven 3.9.4 produces 30 identical-looking errors resulting from colons in filenames (which Windows forbids). The first occurs in the nz.ac.wgtn.shadedetector.clonedetection.ast.ASTCloneDetectionTests test suite:

[INFO] Running nz.ac.wgtn.shadedetector.clonedetection.ast.ASTCloneDetectionTests
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.004 s <<< FAILURE! - in nz.ac.wgtn.shadedetector.clonedetection.ast.ASTCloneDetectionTests
[ERROR] nz.ac.wgtn.shadedetector.clonedetection.ast.ASTCloneDetectionTests  Time elapsed: 0.004 s  <<< ERROR!
java.nio.file.InvalidPathException: Illegal char <:> at index 2: /C:/Users/walto/Documents/code/shadedetector/target/test-classes/commons-collections4-4.0
        at nz.ac.wgtn.shadedetector.clonedetection.ast.ASTCloneDetectionTests.setupPath(ASTCloneDetectionTests.java:26)

https://stackoverflow.com/questions/9834776/java-nio-file-path-issue seems to be a very similar issue -- it seems that something in Java is prepending a / (or sometimes \) to an absolute Windows pathname. That page suggests a variety of hacks for getting around it.

Discussed with @jensdietrich. He suggested creating a test utility (something like Util.getResourceAsPath()) and redirecting all those calls to this. Initially, this code would just do what is currently done, and then we can in a second commit make the changes -- i.e., a "prefactoring". There may be such a utility already in CVE-2018-11771.

Low priority since my gut feeling is that there may be many further issues getting this to run on Windows (there was already #7, and now this).

12 of 29 CVEs failed due to REST API timeouts

"Failure" here means no stats.log file was generated:

wtwhite@wtwhite-vuw-vm:~/code/shadedetector/runs/2_first_big_run$ for f in log*.log; do if [ \! -e stats${f##log} ]; then echo $f; fi; done
log116-CVE-2018-11771.log
log117-CVE-2018-1324.log
log118-CVE-2018-8017.log
log119-CVE-2019-0225.log
log120-CVE-2019-12402.log
log121-CVE-2020-1953.log
log122-CVE-2021-29425.log
log123-CVE-2021-44228.log
log125-CVE-2022-38749.log
log126-CVE-2022-38751.log
log127-CVE-2022-42889.log
log128-CVE-2022-45688.log

All 12 of them were caused by failures trying to fetch artifacts via the REST API:

wtwhite@wtwhite-vuw-vm:~/code/shadedetector/runs/2_first_big_run$ grep -C 3 SocketTimeoutException `cat CVEs_missing_stats_logs.txt`
log116-CVE-2018-11771.log-2023-09-19 04:59:28,827 INFO [main] n.a.w.s.ArtifactSearch [ArtifactSearch.java:138] 	fetching batch 1/-1
log116-CVE-2018-11771.log-2023-09-19 04:59:28,846 INFO [main] n.a.w.s.ArtifactSearch [ArtifactSearch.java:150] 	search url: https://search.maven.org/solrsearch/select?q=g%3Aorg.apache.commons%20AND%20a%3Acommons-compress&core=gav&wt=json&rows=200&start=1
log116-CVE-2018-11771.log-2023-09-19 04:59:39,651 ERROR [main] n.a.w.s.Main [Main.java:254] cannot fetch artifacts for org.apache.commons:commons-compress
log116-CVE-2018-11771.log:nz.ac.wgtn.shadedetector.ArtifactSearchException: java.net.SocketTimeoutException: timeout
log116-CVE-2018-11771.log-	at nz.ac.wgtn.shadedetector.ArtifactSearch.getCachedOrFetchByGroupAndArtifactId(ArtifactSearch.java:158)
log116-CVE-2018-11771.log-	at nz.ac.wgtn.shadedetector.ArtifactSearch.findVersions(ArtifactSearch.java:91)
log116-CVE-2018-11771.log-	at nz.ac.wgtn.shadedetector.Main.main(Main.java:246)
log116-CVE-2018-11771.log:Caused by: java.net.SocketTimeoutException: timeout
log116-CVE-2018-11771.log-	at okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.kt:675)
log116-CVE-2018-11771.log-	at okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.kt:684)
log116-CVE-2018-11771.log-	at okhttp3.internal.http2.Http2Stream.takeHeaders(Http2Stream.kt:143)
--
--snip--

Proposal: Add 3 retries to these REST API requests, as is already done for class name searches.

caching pages

At the moment the cache for search-by-class results will be used if any results are present, even if the batch size or count is less than what is specified (other parameters were used when the cached query was made, or some requests for certain pages timed out while at least some succeeded). When checking whether to use the cache, we should check the overall number of cached results, and if not, invalidate the cache. Just "topping up" might not work as we might get the numbers, but duplicated. So re-caching seems to be the easiest option.

Simultaneous runs on my Linux VM and `piccolo` find different vulnerable clone counts

Run 42 on my Linux VM, begun at Sep 29 11:54, finds 81 unversioned vulnerable clones
Run 68 on piccolo, begun at Sep 29 11:58, finds 86

Details of the CVEs with differences:

wtwhite@wtwhite-vuw-vm:~/code/shadedetector/runs/42_rerun_all_with_better_Makefile$ for f in stats*.log; do ../../tools/compare_stats.pl $f ../../piccolo/$f; done 2>/dev/null |grep CONFIRMED_unversioned=
POV_INSTANCE_VULNERABILITY_CONFIRMED_unversioned=4	is best from ../../piccolo/stats116-CVE-2018-11771.log, second is 2 from stats116-CVE-2018-11771.log
POV_INSTANCE_VULNERABILITY_CONFIRMED_unversioned=9	is best from ../../piccolo/stats122-CVE-2021-29425.log, second is 7 from stats122-CVE-2021-29425.log
POV_INSTANCE_VULNERABILITY_CONFIRMED_unversioned=26	is best from ../../piccolo/stats128-CVE-2022-45688.log, second is 25 from stats128-CVE-2022-45688.log

I suspect the difference stems from nondeterminism in the ordering of classes done for the by-class search.

`pl.droidsonroids.yaml:snakeyaml` duplicated in GHSA PR

First on line 105, then again on line 114.

Could maybe merge the 2 entries into a single one that has both versions and ranges? The spec seems to indicate that's OK.

Large run-to-run variation on CVE-2018-10237

We get 22 vulnerable clones with the first run, 31 with the second! There was also a (normal, insignificant) difference on CVE-2018-8017:

There were no significant code differences between these 2 runs on my Linux VM -- just 210e8a3, which changes the JAVA_HOME but only for piccolo.

wtwhite@wtwhite-vuw-vm:~/code/shadedetector/runs/42_rerun_all_with_better_Makefile$ for f in stats*; do cmp $f ../41_rerun_all_with_fixed_test_for_CVE-2019-0225/$f; done
stats115-CVE-2018-10237.log ../41_rerun_all_with_fixed_test_for_CVE-2019-0225/stats115-CVE-2018-10237.log differ: byte 15, line 1
stats118-CVE-2018-8017.log ../41_rerun_all_with_fixed_test_for_CVE-2019-0225/stats118-CVE-2018-8017.log differ: byte 17, line 1
wtwhite@wtwhite-vuw-vm:~/code/shadedetector/runs/42_rerun_all_with_better_Makefile$ sdiff stats115-CVE-2018-10237.log ../41_rerun_all_with_fixed_test_for_CVE-2019-0225/stats115-CVE-2018-10237.log 
QUERY_RESULTS=3101					      |	QUERY_RESULTS=2852
QUERY_RESULTS_unversioned=333				      |	QUERY_RESULTS_unversioned=323
CONSOLIDATED_QUERY_RESULTS=2208				      |	CONSOLIDATED_QUERY_RESULTS=1633
CONSOLIDATED_QUERY_RESULTS_unversioned=256		      |	CONSOLIDATED_QUERY_RESULTS_unversioned=227
VALID_POM=2206						      |	VALID_POM=1221
VALID_POM_unversioned=255				      |	VALID_POM_unversioned=190
NO_DEPENDENCY_TO_VULNERABLE=1026			      |	NO_DEPENDENCY_TO_VULNERABLE=551
NO_DEPENDENCY_TO_VULNERABLE_unversioned=148		      |	NO_DEPENDENCY_TO_VULNERABLE_unversioned=114
SOURCES_FETCHED=919					      |	SOURCES_FETCHED=474
SOURCES_FETCHED_unversioned=142				      |	SOURCES_FETCHED_unversioned=111
SOURCES_EXPANDED=919					      |	SOURCES_EXPANDED=474
SOURCES_EXPANDED_unversioned=142			      |	SOURCES_EXPANDED_unversioned=111
SOURCES_HAVE_JAVA_FILES=874				      |	SOURCES_HAVE_JAVA_FILES=441
SOURCES_HAVE_JAVA_FILES_unversioned=129			      |	SOURCES_HAVE_JAVA_FILES_unversioned=101
CLONE_DETECTED=130					      |	CLONE_DETECTED=112
CLONE_DETECTED_unversioned=32				      |	CLONE_DETECTED_unversioned=30
POV_INSTANCE_COMPILED=130				      |	POV_INSTANCE_COMPILED=112
POV_INSTANCE_COMPILED_unversioned=32			      |	POV_INSTANCE_COMPILED_unversioned=30
POV_INSTANCE_SUREFIRE_REPORTS_GENERATED=56		      |	POV_INSTANCE_SUREFIRE_REPORTS_GENERATED=41
POV_INSTANCE_SUREFIRE_REPORTS_GENERATED_unversioned=18	      |	POV_INSTANCE_SUREFIRE_REPORTS_GENERATED_unversioned=16
POV_INSTANCE_VULNERABILITY_CONFIRMED=31			      |	POV_INSTANCE_VULNERABILITY_CONFIRMED=22
POV_INSTANCE_VULNERABILITY_CONFIRMED_unversioned=9	      |	POV_INSTANCE_VULNERABILITY_CONFIRMED_unversioned=6
POV_INSTANCE_VULNERABILITY_CONFIRMED_SHADED=12		      |	POV_INSTANCE_VULNERABILITY_CONFIRMED_SHADED=5
POV_INSTANCE_VULNERABILITY_CONFIRMED_SHADED_unversioned=3     |	POV_INSTANCE_VULNERABILITY_CONFIRMED_SHADED_unversioned=1
wtwhite@wtwhite-vuw-vm:~/code/shadedetector/runs/42_rerun_all_with_better_Makefile$ sdiff stats118-CVE-2018-8017.log ../41_rerun_all_with_fixed_test_for_CVE-2019-0225/stats118-CVE-2018-8017.log 
QUERY_RESULTS=946					      |	QUERY_RESULTS=943
QUERY_RESULTS_unversioned=56					QUERY_RESULTS_unversioned=56
CONSOLIDATED_QUERY_RESULTS=687					CONSOLIDATED_QUERY_RESULTS=687
CONSOLIDATED_QUERY_RESULTS_unversioned=38			CONSOLIDATED_QUERY_RESULTS_unversioned=38
VALID_POM=685							VALID_POM=685
VALID_POM_unversioned=38					VALID_POM_unversioned=38
NO_DEPENDENCY_TO_VULNERABLE=532					NO_DEPENDENCY_TO_VULNERABLE=532
NO_DEPENDENCY_TO_VULNERABLE_unversioned=30			NO_DEPENDENCY_TO_VULNERABLE_unversioned=30
SOURCES_FETCHED=523						SOURCES_FETCHED=523
SOURCES_FETCHED_unversioned=29					SOURCES_FETCHED_unversioned=29
SOURCES_EXPANDED=523						SOURCES_EXPANDED=523
SOURCES_EXPANDED_unversioned=29					SOURCES_EXPANDED_unversioned=29
SOURCES_HAVE_JAVA_FILES=393					SOURCES_HAVE_JAVA_FILES=393
SOURCES_HAVE_JAVA_FILES_unversioned=27				SOURCES_HAVE_JAVA_FILES_unversioned=27
CLONE_DETECTED=117						CLONE_DETECTED=117
CLONE_DETECTED_unversioned=4					CLONE_DETECTED_unversioned=4
POV_INSTANCE_COMPILED=116					POV_INSTANCE_COMPILED=116
POV_INSTANCE_COMPILED_unversioned=4				POV_INSTANCE_COMPILED_unversioned=4
POV_INSTANCE_SUREFIRE_REPORTS_GENERATED=96			POV_INSTANCE_SUREFIRE_REPORTS_GENERATED=96
POV_INSTANCE_SUREFIRE_REPORTS_GENERATED_unversioned=3		POV_INSTANCE_SUREFIRE_REPORTS_GENERATED_unversioned=3
POV_INSTANCE_VULNERABILITY_CONFIRMED=17				POV_INSTANCE_VULNERABILITY_CONFIRMED=17
POV_INSTANCE_VULNERABILITY_CONFIRMED_unversioned=1		POV_INSTANCE_VULNERABILITY_CONFIRMED_unversioned=1
POV_INSTANCE_VULNERABILITY_CONFIRMED_SHADED=0			POV_INSTANCE_VULNERABILITY_CONFIRMED_SHADED=0
POV_INSTANCE_VULNERABILITY_CONFIRMED_SHADED_unversioned=0	POV_INSTANCE_VULNERABILITY_CONFIRMED_SHADED_unversioned=0

Check the logs to see what happened. Hypothesis: Some by-class query was failing before, and then stopped failing.

OutOfMemoryError on 4 xshady PoVs

4 CVEs hit OutOfMemoryError:

CVE-2015-6420
CVE-2015-6748
CVE-2015-7501
CVE-2016-2510

The stack trace in each case looks like:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at jdk.zipfs/jdk.nio.zipfs.ZipPath.resolve(ZipPath.java:310)
--snip--
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
at nz.ac.wgtn.shadedetector.Utils.listContent(Utils.java:132)
at nz.ac.wgtn.shadedetector.Utils.listJavaSources(Utils.java:108)
at nz.ac.wgtn.shadedetector.Main.main(Main.java:348)

(Note: These stack traces appear on stderr or possibly stdout, but not in any log specified with -log.)

@jensdietrich mentions hitting an issue like this earlier in Blacklist, and resolving it with JDK 11 or 17. From a comment there:

        // outofmemory when extracting from jar -- check for zip bomb or similar
        // might be related to https://bugs.openjdk.org/browse/JDK-7143743
        // TODO investigate whether this only happends for certain JDK versions (11, 17 seems to be fine)
        return artifact.getGroupId().equals("dev.dejvokep") ;

fetch multiple pages when looking for users of a class

there is an (undocumented?) start parameter that can be used in rest API queries. Should be set to rows+1

Some artifacts have only 1 class, so will match no clones with default `-r moreThanOne`

This is the case for CVE-2016-0779:

[whitewa@piccolo ~/code/shadedetector]$ unzip -l /home/whitewa/code/shadedetector/.cache/src/org.apache.openejb/apache-tomee/1.7.3/apache-tomee-1.7.3-sources.jar|grep -F .java
     7190  11-24-2015 18:18   org/apache/tomee/RemoteTomEEEJBContainer.java

This old jar does not even seem to have a pom.xml file -- maybe it's too old to consider further.

CVE-2017-18349 fails analysis due to too few versions of original artifact being retrieved

In a recent run of 29 CVEs, CVE-2017-18349 failed to produce any stats113-CVE-2017-18349.log file. This was also the case for all previous runs, including the first, which has as its complete logs:

2023-09-09 11:44:37,065 INFO [main] n.a.w.s.Main [Main.java:115] file log appender set up, log file is: /home/whitewa/code/shadedetector/log113-CVE-2017-18349.log
2023-09-09 11:44:37,068 INFO [main] n.a.w.s.Main [Main.java:514] using clone detector: ast
2023-09-09 11:44:37,069 INFO [main] n.a.w.s.Main [Main.java:514] using class selector: complexnames
2023-09-09 11:44:37,069 INFO [main] n.a.w.s.Main [Main.java:514] using result consolidation strategy: moreThanOne
2023-09-09 11:44:37,069 INFO [main] n.a.w.s.Main [Main.java:514] using result reporter: log
2023-09-09 11:44:37,069 ERROR [main] n.a.w.s.Main [Main.java:163] progress stats will be written to /home/whitewa/code/shadedetector/stats.log
2023-09-09 11:44:37,337 INFO [main] n.a.w.s.ArtifactSearch [ArtifactSearch.java:145]    fetching batch 1/1
2023-09-09 11:44:37,363 INFO [main] n.a.w.s.ArtifactSearch [ArtifactSearch.java:157]    search url: https://search.maven.org/solrsearch/select?q=g%3Acom.alibaba%20AND%20a%3Afastjson&core=gav&wt=json&rows=200&start=1
2023-09-09 11:44:45,486 INFO [main] n.a.w.s.ArtifactSearch [ArtifactSearch.java:169]    response code is: 200
2023-09-09 11:44:45,504 INFO [main] n.a.w.s.ArtifactSearch [ArtifactSearch.java:174]    caching data in /home/whitewa/code/shadedetector/.cache/artifacts-versions/com.alibaba:fastjson-1.json
2023-09-09 11:44:45,612 INFO [main] n.a.w.s.ArtifactSearch [ArtifactSearch.java:94]     200 versions found of "com.alibaba:fastjson
2023-09-09 11:44:45,613 ERROR [main] n.a.w.s.Main [Main.java:182] cannot locate artifacts for com.alibaba:fastjson

The PoV is com.alibaba:fastjson:1.2.24, and it turns out that there are 331 versions of this in the Maven Central Repo, but 1.2.24 is not among the 200 versions retrieved by ArtifactSearch.findVersions():

[whitewa@piccolo ~/code/xshady/CVE-2017-18349]$ jq .response.numFound /local/scratch/whitewa/shadedetector/.cache/artifacts-versions/com.alibaba:fastjson-1.json
331
[whitewa@piccolo ~/code/xshady/CVE-2017-18349]$ jq .response.docs[].v < /local/scratch/whitewa/shadedetector/.cache/artifacts-versions/com.alibaba:fastjson-1.json|grep -F 1.2.2
"1.2.2.sec10"
"1.2.29.sec10"
"1.2.25.sec10"
"1.2.27.sec10"
"1.2.27.sec09"
"1.2.29.sec09"
"1.2.27.sec06"
"1.2.29.sec06"
"1.2.29.sec04"

Root cause: ArtifactSearch.findVersions() is only called with a batch size of 1 for versions of the original PoV artifact, so it will only ever find at most 200 versions of it, even though there could be more versions available and the PoV might be one of the excluded versions.

`com.guicedee.services:commons-text` duplicated in GHSA PR

First on line 40, then also on line 49.

Could maybe merge the 2 entries into a single one that has both versions and ranges? The spec seems to indicate that's OK.

Possible clone missed for CVE-2019-0225

Based on the following log output, I would not be surprised if com.liferay:com.liferay.wiki.engine.jspwiki:2.0.3 is indeed a clone of some part of org.apache.jspwiki:jspwiki-main, and vulnerable:

wtwhite@wtwhite-vuw-vm:~/code/shadedetector/results$ grep 'analysing whether artifact' log119-CVE-2019-0225.log|grep -v apache
2023-09-28 22:49:04,701 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.engine.jspwiki:2.0.0 matches
2023-09-28 22:49:04,724 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.service:1.1.1 matches
2023-09-28 22:49:04,773 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.service:1.2.8 matches
2023-09-28 22:49:04,821 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.service:1.0.4 matches
2023-09-28 22:49:10,391 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.service:1.2.4 matches
2023-09-28 22:49:10,463 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.service:1.2.9 matches
2023-09-28 22:49:10,484 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.service:1.2.2 matches
2023-09-28 22:49:10,503 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.service:1.1.2 matches
2023-09-28 22:49:10,522 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.engine.jspwiki:2.0.3 matches
2023-09-28 22:49:10,529 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.service:1.2.10 matches
2023-09-28 22:49:10,548 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.service:1.1.0 matches
2023-09-28 22:49:10,567 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.engine.jspwiki:1.0.0 matches
2023-09-28 22:49:10,574 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.service:1.2.1 matches
2023-09-28 22:49:10,593 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.service:1.2.3 matches
2023-09-28 22:49:10,612 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.service:1.2.12 matches
2023-09-28 22:49:10,632 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.service:1.1.3 matches
2023-09-28 22:49:10,682 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.engine.jspwiki:2.0.4 matches
2023-09-28 22:49:10,690 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.service:1.2.6 matches
2023-09-28 22:49:12,394 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.engine.jspwiki:1.0.1 matches
2023-09-28 22:49:12,402 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.service:1.2.0 matches
2023-09-28 22:49:12,452 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.engine.jspwiki:2.0.2 matches
2023-09-28 22:49:14,228 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.service:1.2.7 matches
2023-09-28 22:49:14,259 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.service:1.2.11 matches
2023-09-28 22:49:14,278 INFO [main] n.a.w.s.Main [Main.java:438] analysing whether artifact com.liferay:com.liferay.wiki.service:1.2.5 matches

The strongest evidence I have found is that Sonatype's OSSIndex already lists this artifact as being vulnerable to CVE-2019-0225 (this requires setting up an account).

The original apache/jspwiki@88d89d6 fixed a class called DefaultURLConstructor; the binary jar for com.liferay:com.liferay.wiki.engine.jspwiki:2.0.3 contains a class by this name:

wtwhite@wtwhite-vuw-vm:~/code/shadedetector/TEST_LIFERAY_CLASSES$ unzip -l com.liferay.wiki.engine.jspwiki-2.0.3.jar |grep DefaultURLConstructor
     4334  2010-05-09 15:09   com/ecyrd/jspwiki/url/DefaultURLConstructor.class

but the source jar does not.

This suggested that the DefaultURLConstructor class must be included in the binary jar from a dependency, so I also checked the dependencies in the POM by making dummy Maven projects for each of the following "likely" dependencies that built an assembly jar using the maven-assembly-plugin:

"Likely" dependencies tried:
- com.liferay:com.liferay.wiki.api:2.0.0
- com.liferay:com.liferay.wiki.engine.input.editor.common:2.0.0
- com.liferay.portal:com.liferay.portal.impl:2.0.0
- com.liferay.portal:com.liferay.portal.kernel:2.0.0
- com.liferay.portal:com.liferay.util.taglib:2.0.0
- opensymphony:oscache:2.4.1
"Unlikely" dependencies ignore:
- javax.portlet:portlet-api:2.0
- javax.servlet:javax.servlet-api:3.0.1
- org.osgi:org.osgi.service.component.annotations:1.3.0
- oro:oro:2.0.8

However, none of the resulting assembly jars contained a DefaultURLConstructor class, suggesting that some other (external-to-Maven) mechanism was used to pull in either the source, or the already-built class -- it may be possible to get further by looking through the GitHub URL mentioned in the POM but for now I've spent enough time on this.

Low priority.

No vulnerable artifacts found for `CVE-2016-2510` in latest run

But in the preprint, we found 3 (2 unversioned) -- see Table 3.