simding / jplag Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jplag/jplag

0.0 0.0 0.0 2.26 MB

Detecting Software Plagiarism and Collusion since 1996.

License: GNU General Public License v3.0

Java 81.62% HTML 3.86% GAP 8.01% ANTLR 4.98% Scheme 1.49% JavaScript 0.04%

jplag's People

jplag's Issues

Readme

This CLI has new options:

Clustering:
  --cluster-skip         Skips the clustering (Standard: false)
  --cluster-alg {AGGLOMERATIVE,SPECTRAL}
                         Which clustering algorithm to use. Agglomerative merges similar submissions  bottom  up.  Spectral  clustering  is  combined  with Bayesian Optimization to execute the k-Means clustering
                         algorithm multiple times, hopefully finding a "good" clustering automatically. (Standard: SPECTRAL)
  --cluster-metric {AVG,MIN,MAX,INTERSECTION}
                         The metric used for clustering. AVG is Dice's coefficient, MAX is the overlap coefficient and can prevent some methods of obfuscation. (Standard: MAX)
  --cluster-spectral-bandwidth bandwidth
                         Bandwidth of the matern kernel in the Gaussian Process used during the search  for  a  good  number  of  clusters for spectral clustering. If a good clustering result is found during the
                         search, numbers of clusters that differ by something in range of the bandwidth are also expected to good. (Standard: 20.0)
  --cluster-spectral-noise noise
                         The result of each run in the search for good clusterings are random. The  noise  level  models  the  variance in the "worth" of these results. It also acts as a regularization constant.
                         (Standard: 0.0025000002)
  --cluster-spectral-min-runs min
                         Minimum number of k-Means executions during spectral clustering. With these initial clustering sizes are explored. (Standard: 5)
  --cluster-spectral-max-runs max
                         Maximum number of k-Means executions during spectral clustering. Any execution after the  initial  runs  tries to balance between exploration of unknown clustering sizes and exploitation
                         of clustering sizes known as good. (Standard: 50)
  --cluster-spectral-kmeans-interations iterations
                         Maximum number of iterations during each execution of the k-Means algorithm. (Standard: 200)
  --cluster-agglomerative-threshold threshold
                         Only clusters with an inter-cluster-similarity greater than this threshold are merged during agglomerative clustering. (Standard: 0.2)
  --cluster-agglomerative-inter-cluster-similarity {MIN,MAX,AVERAGE}
                         How to measure the similarity of two clusters during agglomerative clustering. Minimum, maximum or average similarity between the submissions in each cluster. (Standard: AVERAGE)

Clustering - Preprocessing:
  --cluster-pp-none      Do not use any preprocessing before clustering. Not recommended for spectral clustering. (Standard: false)
  --cluster-pp-cdf       Before clustering, the value of the cumulative distribution function of  all  similarities  is  estimated.  The  similarities  are multiplied with these estimates. This has the effect of
                         supressing similarities that are low compared to other similarities. (Standard: false)
  --cluster-pp-percentile percentile
                         Any similarity smaller than the given percentile will be suppressed during clustering.
  --cluster-pp-threshold threshold
                         Any similarity smaller than the given threshold value will be suppressed during clustering.

Clustering

By default, JPlag is configured to perform a clustering of the submissions.
The clustering partitions the set of submissions into groups of similar submissions.
The found clusters can be used candidates for potentially colluding groups. Each cluster has a strength score, that measures how suspicious the cluster is compared to other clusters.

Disabling Clustering

Clustering can take long when there is a large amount of submissions.
Users who are not interested in the clustering can safely disable it:

Using the CLI: With the --cluster-skip option

Programmatically:

JPlagOptions options = new JPlagOptions("/path/to/rootDir", LanguageOption.JAVA);
options.setClusteringOptions(new ClusteringOptions.Builder().enabled(false).build());

JPlag jplag = new JPlag(options);

Clustering Configuration

Clustering can either be configured using the CLI options or programmatically using the ClusteringOptions class. Both options work analogous and share the same default values.

The clustering it designed to work out-of-the-box for running within the magnitude of about 50-500 submissions, but it can be tweaked when problems occur. For more submissions it might be necessary to increase Max-Runs or Bandwidth, so that an appropriate number of clusters can be determined.

Group	Option	Description	Default
General	Enable	Controls whether the clustering is run at all.	`true`
General	Algorithm	Which clustering algorithm to use. Agglomerative Clustering Agglomerative Clustering iteratively merges similar submissions bottom up. It usually requires manual tuning for it's parameters to yield helpful clusters. Spectral Clustering Spectral Clustering is combined with Bayesian Optimization to execute the k-Means clustering algorithm multiple times, hopefully finding a "good" clustering automatically. It's default parameters should work O.K. in most cases.	Spectral Clustering
General	Metric	The similarity score between submissions to use during clustering. Each score is expressed in terms of the size of the submissions `A` and `B` and the size of their matched intersection `A ∩ B`. AVG (aka. Dice's coefficient) `AVG = 2 * (A ∩ B) / (A + B)` MAX (aka. overlap coefficient) `MAX = (A ∩ B) / min(A, B)` Compared to MAX, this prevents obfuscation when a collaborator bloats his submission with unrelated code. MIN (deprecated) `MIN = (A ∩ B) / max(A, B)` INTERSECTION (experimental) `INTERSECTION = A ∩ B`	MAX
Spectral	Bandwidth	For Spectral Clustering, Baysian Optimization is used to determine a fitting number of clusters. If a good clustering result is found during the search, numbers of clusters that differ by something in range of the bandwidth are also expected to good. Low values result in more exploration of the search space, high values in more exploitation of known results.	20.0
Spectral	Noise	The result of each k-Means run in the search for good clusterings is random. The noise level models the variance in the "worth" of these results. It also acts as a regularization constant.	0.0025
Spectral	Min-Runs	Minimum number of k-Means executions for spectral clustering. With these initial runs clustering sizes are explored.	5
Spectral	Max-Runs	Maximum number of k-Means executions during spectral clustering. Any execution after the initial (min-) runs tries to balance between exploration of unknown clustering sizes and exploitation of clustering sizes known as good.	50
Spectral	K-Means Iterations	Maximum number of iterations during each execution of the k-Means algorithm.	200
Agglomerative	Threshold	Only clusters with an inter-cluster-similarity greater than this threshold are merged during agglomerative clustering.	0.2
Agglomerative	inter-cluster-similarity	How to measure the similarity of two clusters during agglomerative clustering. MIN (aka. complete-linkage) Clusters are merged if all their submissions are similar. MAX (aka. single-linkage) Clusters are merged if there is a similar submission in both. AVERAGE (aka. average-linkage) Clusters are merged if their submissions are similar on average.	AVERAGE
Preprocessing	Pre-Processor	How the similarities are preprocessed prior to clustering. Spectral Clustering will probably not have good results without it. None No preprocessing. Cumulative Distribution Function (CDF) Before clustering, the value of the cumulative distribution function of all similarities is estimated. The similarities are multiplied with these estimates. This has the effect of suppressing similarities that are low compared to other similarities. Percentile Any similarity smaller than the given percentile will be suppressed during clustering. Threshold Any similarity smaller than the given threshold will be suppressed during clustering.	CDF

Contributing to JPlag

Clustering

All clustering related classes are contained within the de.jplag.clustering(.*) packages.

The central idea behind the structure of clustering is the ease of use: To use the clustering calling code should only ever interact with the ClusteringOptions, ClusteringFactory, and ClusteringResult classes:

New clustering algorithms and preprocessors can be implemented using the GenericClusteringAlgorithm and ClusteringPreprocessor interfaces which operate on similarity matrices only. ClusteringAdapter handles the conversion between de.jplag classes and matrices. PreprocessedClusteringAlgorithm adds a preprocessor onto another ClusteringAlgorithm.

Remarks on Spectral Clustering

based on On Spectral Clustering: Analysis and an algorithm (Ng, Jordan & Weiss, 2001)
automatic hyper-parameter search using Bayesian Optimization with a Gaussian Process as the surrogate model and L-BFGS for optimization on the surrogate
the L-BFGS implementation is a pit of technical debt, see here.

Integration Tests

There are integration tests for the Spectral Clustering to verify, that a least in the case of two known sets of similarities the groups known to be colluders are found. However, these are considered to be sensitive data. The datasets are not available to the public and these tests can only be run by maintainers with access.

To run these tests the contents of the PseudonymizedReports repository must added in the folder jplag/src/test/resources/de/jplag/PseudonymizedReports.

simding / jplag Goto Github PK

jplag's People

jplag's Issues

Readme

How to Use

Clustering

Disabling Clustering

Clustering Configuration

Contributing to JPlag

Clustering

Remarks on Spectral Clustering

Integration Tests

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs