simding / jplag Goto Github PK
View Code? Open in Web Editor NEWThis project forked from jplag/jplag
Detecting Software Plagiarism and Collusion since 1996.
License: GNU General Public License v3.0
This project forked from jplag/jplag
Detecting Software Plagiarism and Collusion since 1996.
License: GNU General Public License v3.0
This CLI has new options:
Clustering:
--cluster-skip Skips the clustering (Standard: false)
--cluster-alg {AGGLOMERATIVE,SPECTRAL}
Which clustering algorithm to use. Agglomerative merges similar submissions bottom up. Spectral clustering is combined with Bayesian Optimization to execute the k-Means clustering
algorithm multiple times, hopefully finding a "good" clustering automatically. (Standard: SPECTRAL)
--cluster-metric {AVG,MIN,MAX,INTERSECTION}
The metric used for clustering. AVG is Dice's coefficient, MAX is the overlap coefficient and can prevent some methods of obfuscation. (Standard: MAX)
--cluster-spectral-bandwidth bandwidth
Bandwidth of the matern kernel in the Gaussian Process used during the search for a good number of clusters for spectral clustering. If a good clustering result is found during the
search, numbers of clusters that differ by something in range of the bandwidth are also expected to good. (Standard: 20.0)
--cluster-spectral-noise noise
The result of each run in the search for good clusterings are random. The noise level models the variance in the "worth" of these results. It also acts as a regularization constant.
(Standard: 0.0025000002)
--cluster-spectral-min-runs min
Minimum number of k-Means executions during spectral clustering. With these initial clustering sizes are explored. (Standard: 5)
--cluster-spectral-max-runs max
Maximum number of k-Means executions during spectral clustering. Any execution after the initial runs tries to balance between exploration of unknown clustering sizes and exploitation
of clustering sizes known as good. (Standard: 50)
--cluster-spectral-kmeans-interations iterations
Maximum number of iterations during each execution of the k-Means algorithm. (Standard: 200)
--cluster-agglomerative-threshold threshold
Only clusters with an inter-cluster-similarity greater than this threshold are merged during agglomerative clustering. (Standard: 0.2)
--cluster-agglomerative-inter-cluster-similarity {MIN,MAX,AVERAGE}
How to measure the similarity of two clusters during agglomerative clustering. Minimum, maximum or average similarity between the submissions in each cluster. (Standard: AVERAGE)
Clustering - Preprocessing:
--cluster-pp-none Do not use any preprocessing before clustering. Not recommended for spectral clustering. (Standard: false)
--cluster-pp-cdf Before clustering, the value of the cumulative distribution function of all similarities is estimated. The similarities are multiplied with these estimates. This has the effect of
supressing similarities that are low compared to other similarities. (Standard: false)
--cluster-pp-percentile percentile
Any similarity smaller than the given percentile will be suppressed during clustering.
--cluster-pp-threshold threshold
Any similarity smaller than the given threshold value will be suppressed during clustering.
By default, JPlag is configured to perform a clustering of the submissions.
The clustering partitions the set of submissions into groups of similar submissions.
The found clusters can be used candidates for potentially colluding groups. Each cluster has a strength score, that measures how suspicious the cluster is compared to other clusters.
Clustering can take long when there is a large amount of submissions.
Users who are not interested in the clustering can safely disable it:
--cluster-skip
optionJPlagOptions options = new JPlagOptions("/path/to/rootDir", LanguageOption.JAVA);
options.setClusteringOptions(new ClusteringOptions.Builder().enabled(false).build());
JPlag jplag = new JPlag(options);
Clustering can either be configured using the CLI options or programmatically using the ClusteringOptions
class. Both options work analogous and share the same default values.
The clustering it designed to work out-of-the-box for running within the magnitude of about 50-500 submissions, but it can be tweaked when problems occur. For more submissions it might be necessary to increase Max-Runs
or Bandwidth
, so that an appropriate number of clusters can be determined.
Group | Option | Description | Default |
---|---|---|---|
General | Enable | Controls whether the clustering is run at all. | true |
General | Algorithm | Which clustering algorithm to use.
|
Spectral Clustering |
General | Metric | The similarity score between submissions to use during clustering. Each score is expressed in terms of the size of the submissions A and B and the size of their matched intersection A ∩ B .
|
MAX |
Spectral | Bandwidth | For Spectral Clustering, Baysian Optimization is used to determine a fitting number of clusters. If a good clustering result is found during the search, numbers of clusters that differ by something in range of the bandwidth are also expected to good. Low values result in more exploration of the search space, high values in more exploitation of known results. | 20.0 |
Spectral | Noise | The result of each k-Means run in the search for good clusterings is random. The noise level models the variance in the "worth" of these results. It also acts as a regularization constant. | 0.0025 |
Spectral | Min-Runs | Minimum number of k-Means executions for spectral clustering. With these initial runs clustering sizes are explored. | 5 |
Spectral | Max-Runs | Maximum number of k-Means executions during spectral clustering. Any execution after the initial (min-) runs tries to balance between exploration of unknown clustering sizes and exploitation of clustering sizes known as good. | 50 |
Spectral | K-Means Iterations | Maximum number of iterations during each execution of the k-Means algorithm. | 200 |
Agglomerative | Threshold | Only clusters with an inter-cluster-similarity greater than this threshold are merged during agglomerative clustering. | 0.2 |
Agglomerative | inter-cluster-similarity | How to measure the similarity of two clusters during agglomerative clustering.
|
AVERAGE |
Preprocessing | Pre-Processor | How the similarities are preprocessed prior to clustering. Spectral Clustering will probably not have good results without it.
|
CDF |
All clustering related classes are contained within the de.jplag.clustering(.*)
packages.
The central idea behind the structure of clustering is the ease of use: To use the clustering calling code should only ever interact with the ClusteringOptions
, ClusteringFactory
, and ClusteringResult
classes:
New clustering algorithms and preprocessors can be implemented using the GenericClusteringAlgorithm
and ClusteringPreprocessor
interfaces which operate on similarity matrices only. ClusteringAdapter
handles the conversion between de.jplag
classes and matrices. PreprocessedClusteringAlgorithm
adds a preprocessor onto another ClusteringAlgorithm
.
There are integration tests for the Spectral Clustering to verify, that a least in the case of two known sets of similarities the groups known to be colluders are found. However, these are considered to be sensitive data. The datasets are not available to the public and these tests can only be run by maintainers with access.
To run these tests the contents of the PseudonymizedReports repository must added in the folder jplag/src/test/resources/de/jplag/PseudonymizedReports
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.