dodona-edu / dolos Goto Github PK

:detective: Source code plagiarism detection

License: MIT License

TypeScript 44.47% JavaScript 1.07% HTML 0.11% Vue 34.98% SCSS 0.20% Dockerfile 0.28% Ruby 11.74% Nix 6.64% Shell 0.42% Python 0.10%

learn-to-code code-similarity plagiarism-checker plagiarism-detection education hacktoberfest dodona online-learning collusion-detection plagiarism

dolos's People

Contributors

Stargazers

Watchers

dolos's Issues

suffix tree-based methods for clone/plagiarm

Optimize compare-view rendering

For larger files the rendering of a compare-view can get quite slow, for example the on hover highlighting might take about a second. There are two ways to optimize this:

only highlight the lines that are currently visible instead of all the lines in the file
optimize the spans so that not each span individually has to be highlighted but instead a parent spans is used for highlighting.

When this is done the first options should be tested first to see if it solves the problem.

Allow AST tokens to match multiple areas in a file

We currently use a Selection class to show which area in a file corresponds to an AST token. This has a start and a stop location, but its child tokens will overlap with these selections. It would be ideal to have a way to keep track of which tracks belong to that node of the AST and which nodes will belong to its children.

The main reason why we want this: if only the function declaration matches between two files, we now highlight the full function. This gives a wrong impression that the whole file is matched. In this case only the declaration and the closing bracket should be highlighted.

We should think about how we'll model this. I see a few options:

Model the tokens as a tree (which shouldn't be difficult, the AST is already one) where we can extract this information from.
Keep track which selection(s) belong to the children. Is it enough to have just one inner selection, or should this be a list?

Dependabot couldn't find a package.json for this project

Dependabot couldn't find a package.json for this project.

Dependabot requires a package.json to evaluate your project's current JavaScript dependencies. It had expected to find one at the path: /package.json.

If this isn't a JavaScript project, or if it is a library, you may wish to disable updates for it from within Dependabot.

View the update logs.

routing in html output

Implement something like this in the html generated by the html formatters so that it behaves more like people would expect from a website.

Change output format from dolos analysis

The current output format puts all the fragment data into the intersections.csv file, while this is fine for smaller files this can get quite big for more realistic examples. Putting all the fragment data into the csv file has a considerable effect on the web UI as it is forced to load in all the fragments in all at once. Putting the this data into separate files would help resolve these issues.

Dependabot couldn't find a package.json for this project

Dependabot couldn't find a package.json for this project.

Dependabot requires a package.json to evaluate your project's current JavaScript dependencies. It had expected to find one at the path: /package.json.

If this isn't a JavaScript project, or if it is a library, you may wish to disable updates for it from within Dependabot.

View the update logs.

Dependabot couldn't find a package.json for this project

Dependabot couldn't find a package.json for this project.

Dependabot requires a package.json to evaluate your project's current JavaScript dependencies. It had expected to find one at the path: /package.json.

If this isn't a JavaScript project, or if it is a library, you may wish to disable updates for it from within Dependabot.

View the update logs.

Dependabot couldn't find a package.json for this project

Dependabot couldn't find a package.json for this project.

Dependabot requires a package.json to evaluate your project's current JavaScript dependencies. It had expected to find one at the path: /package.json.

If this isn't a JavaScript project, or if it is a library, you may wish to disable updates for it from within Dependabot.

View the update logs.

get rid of prettier and use eslint formatting

create a File class to keep track of the amount of lines in a file (and other metadata)

In summary.ts and htmlFormatter.ts the score is calculated using the amount of lines in a file, but this was 'cached' in a Utils class. I have now extracted this functionality to those two files (duplicated).

A File class should be created to store file metadata and avoid this duplication.

Quality benchmark: execution, visualization and publishing

We should perform a benchmark on the quality of the plagiarism detection and a way to display the benchmark results.

I was thinking of having a site for Dolos, where we have the project description with our benchmark results.

Dependabot couldn't find a package.json for this project

Dependabot couldn't find a package.json for this project.

Dependabot requires a package.json to evaluate your project's current JavaScript dependencies. It had expected to find one at the path: /package.json.

If this isn't a JavaScript project, or if it is a library, you may wish to disable updates for it from within Dependabot.

View the update logs.

Dependabot couldn't find a package.json for this project

Dependabot couldn't find a package.json for this project.

Dependabot requires a package.json to evaluate your project's current JavaScript dependencies. It had expected to find one at the path: /package.json.

If this isn't a JavaScript project, or if it is a library, you may wish to disable updates for it from within Dependabot.

View the update logs.

replace tuples by objects to improve readability of the code

It seems that small objects are hardly any slower and are a lot more readable.

tokenizer test sometimes fails with "Illegal invocation"

 FAIL  src/lib/__tests__/tokenizer.test.ts
  ● Test suite failed to run

    TypeError: Illegal invocation



      at Object.get [as rootNode] (node_modules/tree-sitter/index.js:20:35)
      at Object.<anonymous> (node_modules/tree-sitter/index.js:16:26)
      at Object.<anonymous> (src/lib/codeTokenizer.ts:740:40)

Subtree matching algorithm

PR #78 implements a subtree matching algorithm as an interesting alternative to hash matching. But we need to compare these two techniques to pick which one to use.

@ISteampowered it would be useful to have a link to a description of this algorithm.

comparison.ts is untested

Changes in other files that break comparison are not detected.

Plagiarism graph

Each submission is a node and has links to nodes it has a similarity with above a certain threshold, creating a graph. If we have the moment of submission we can create a directed graph, giving an indication who plagiarized from whom.

benchmark datasets

Web UI of analysis using Vue

add k-mers size and winnow size to be set in the cli

Currently in the Comparison class the default values are always used, add cli flag that allow a user to change them.

"There were no matches" when comparing only 2 files

Running yarn start with two files doesn't find matches (even when two files are equal). Running with three files works as expected.

Drop support for Haskell

Unfortunately, the tree-sitter team has mentioned that the Haskell tree-sitter grammar is too complex and they have plans to rewrite it. But in the mean time it's no longer supported.

The Haskell grammar is currently stopping us from migrating to tree-sitter 0.16, which will allow us to use Node 12 again. So I propose that we (temporarily) drop support for Haskell until we have a working tree-sitter plugin for it again.

HTMLFormatter: separate code, data and markup

When creating the HTMLFormatter we didn't want to depend on a templating engine because we don't need to have such a huge dependency for what we want to achieve. The HTML file is currently generated inline, which makes it unwieldy to make layout changes.

I propose the create an HTML template, which we populate using simple substitutions or a lightweight templating engine like JavasScript-Templates.

We should also look into the possibilities to add JavaScript or CSS dependencies to the generate HTML page, since we'll want to add visualizations and maybe a CSS framework in the future.

Allow for plain text matching

Add a setting to the user can choose if they want to use the code tree or just the plain text of the code itself for matching.

block list and information view

This component will be shown in the compare-view and will have several functionalities:

show a list of blocks that are currently present in the compare-view
allow the user to show/hide individual entries in this list
allow the user to cycle through this list
show information of the currently selected block, such as the matched kmer

show file errors

In the Comparison class, unreadable files are ignored and a warning is written to console.error:

dolos/src/lib/comparison.ts

Line 156 in e94cd92

console.error(`There was a problem parsing ${file}. ${error}`);

Once we can use Promise.allSettled() we should use it and keep track of which files we could not read and report it to the user.

filter output

filter output based on two things:

language boilerplate
exercise boilerplate

summarize the raw results

When performing a comparison, for each file, a list of matching files is returned. For each of those matching files, a list of matching lines is returned.

At the lowest level, the list of matches should be converted to a list of ranges
A similarity score per pair of files should be calculated (based on the number of matching lines or the number of matching k-mers)
When reporting, there are 2 options:
- Pick a file and get the list of matching files, sorted from highest to lowest similarity
- Show a list of the top pairwise comparisons (this is what moss does). While ok, I’m not a great fan of this because if there are bigger groups of cooperating students, these will take up a large portion at the top of the result table. Such groups should actually be grouped and not listed pair-wise.

Visualize diff

Comparing two files is the main functionality of Dolos, so this should be a clear and easy-to-use component of our web UI.

Requirements:

code highlighting of a single file
- the highlighter should be flexible enough so we can inject/extend the highlighting with our own styling
highlighting which parts are unique and which parts are shared
easy and intuitive navigation between two files
- barcode chart: show a mini map of the file (fitting on one page) showing the parts that are shared (red) and unique (green); in order no to clutter the chart, we might only show the shared regions (hunks)
- clicking on a shared part should scroll the common part of the other file into view; apart from aligning the two regions in the two files, we could also show a "diff" (in this case a diff between the original texts) between the two shared regions, highlighting their differences
- if there are multiple matching parts in the other (or the current) file, they should be shown on the mini map when selected; there should be a way to toggle between these multiple parts
- synchronous scrolling between both files (e.g. after aligning them on two hunks); Atom uses keyboard shortcut Ctrl-Alt-S to toggle scroll-sync on/off
an overview of the matches between two files with some basic information (how long, how many kmers); where matches can be selected (showing them on the barcode chart and code viewer) and ignored

Dot plot visualization

A dot plot looks like a promising visualization to compare two files.

Remove unused cli options and rename options to be more logical

Various options are no longer supported in dolos but are still present in the cli. For example the --directory and --base options

normalized compression distance

AC tool uses normalized compression distance as a pairwise distance measure between two documents.

fix test “tokenizer with or without location is equal”

Due to an (minor) update of tree-sitter-javascript, one of our tests fails and is disabled for now.
The PR that caused this change (and adds named fields) can be found at tree-sitter/tree-sitter-javascript#96.

We can change our custom output to also include these named fields (apparently a tree-sitter feature) or remove the test. In case of the latter, we could add an additional test to compare the output of a fixed sample file with a snapshot.

Dependabot couldn't find a package.json for this project

Dependabot couldn't find a package.json for this project.

Dependabot requires a package.json to evaluate your project's current JavaScript dependencies. It had expected to find one at the path: /package.json.

If this isn't a JavaScript project, or if it is a library, you may wish to disable updates for it from within Dependabot.

View the update logs.

switch from CircleCI to Github Actions

And use node 12.

clustering algorithm not ideal

The current clustering algorithm, which is based on the union-find algorithm, groups matches together even when the matches do not make the criteria to belong to the cluster. For example we have four files A, B, C and D and the matches between these are:

A with B
B with C
C with D
A with D

Let's say the first three have a high enough score to be grouped together, then we will have a single equivalence group. With all the four files in them. Now the fourth and final match does not have a high enough score for the two files to be considered equal, so nothing changes to the equivalent groups.
This is all as expected within union find. However this forms a problem for us as the fourth match will be grouped together with the first three despite it not having a high enough score to do so.
This will cause problems when one student copies many code fragments from many different people.
TL;DR: algorithm assumes that there is an equivalence relationship when there isn't one.

suggestion: abstract tokenizer

Now only a tokenizer for code is implemented. I suggest adding a tokenizer interface (like HashFilter) that can be extended for other tokenizers:

Tokenizer for text (or latex) where whitespaces and newlines are removed
Tokenizer for pdf or docx-files that extract text

I propose something along the lines of:

abstract class Tokenizer<Position> {
  public abstract tokenize(text: string): string;
  public abstract async tokenizeWithMapping(text: string): Promise<[string, Position[]]>;
  ...
}

Here Position is a generic type that refers to a mapped position (e.g. a line number number for code, or a page and paragraph [number, number] for docx...)

The tokenizer for code would implement this as:

class CodeTokenizer extends Tokenizer<number> {
  ...
}

List of all submissions that show similarity score above threshold

In addition to have a list with all pairs of submissions that have a similarity score above a given threshold, we could also show a list of all submissions that are part of at least one such a pair, together with some stats:

number of pairs the submission is involved in (number of edges in the graph)
highest similarity score among all other submissions (maybe with option to show diff with closest submission or with any of the other submissions); show this both as a visual bar and as a percentage; use this as a sorting criterion

Each submission will occur at most once in this list, whereas it might occur multiple times in a list showing all pairs. A shorter list might be easier to deal with than a list of pairs.

switch from tslint to eslint

make dolos available as a CLI

Next to the current library, also offer Dolos as a CLI (and publish it through npm).

split up app.ts into Dolos.ts (API) and cli.ts (CLI)

We should split up our app.ts into a CLI part (cli.ts) and a library part (dodona.ts). The CLI part is then responsible to create a configuration to call the library with.

Other things that need to be done in app.ts:

The options are currently described very concise. This causes them to span multiple lines. We could look into moving these explanations to a document (e.g. the README) and making the options descriptions themselves a bit shorter.
Update examples to reflect the current functionality

Web UI: Enhance analysis overview

The overview should give the users a good impression for each intersection if there is a possibility of plagiarism or not. There will possibly be a lot of entries in this overview so we should make it easy for users to save some progress.

Possible features:

Mark intersections as "certain plagiarism", "suspicious" and "innocent" (better tag names are possible).
Add notes to files or intersections.
Filter on these intersections.
Extension: Give a quick impression of which parts of the intersections are shared/unique (best modeled after the barcode chart of #230)

benchmark results

An issue, to aggregate the results and conclusions from the benchmark results.

Authorship attribution

This came from a discussion with Annick, Rien and Pieter:

With plagiarism detection you can only detect similarity between code (either inside Dodona or available on the Internet). In fraud detection this still doesn't exclude that a user received external help (e.g. during an online test/exam). Therefore it would be interesting if we could extract an author signature from the submission history of a user and then detect deviations from the signature: within the timeline of submissions of a user for a single exercise (e.g. at some point the student got a solution from someone else) or within the timeline of all submissions of a user.

This issue collects some ideas and literature found around this topic.

implement JSON and html formatters

Allow users to export the results as json or as html.

filter the comparison output

The output should be filtered in at least 3 ways:

exclude matches with the file itself (at post-processor)
matches with given boilerplate code (add a blacklist to Comparison)
common hashes matching many files. These are probably common constructs. The threshold for this could be user-defined. (at post-processor or when comparing)

Compare with MOSS results

MOSS

Rename some classes

One of the hardest things in computer science.

We have the following classes (from bottom to top). I think names in bold need another name, but suggestions for the other classes are welcome as well.

File: a submission with source code belonging to a student.
SharedKmer: a substring of the AST which is shared by at least two files (which keeps a reference to those files).
Match: a connector between two matching files. It keeps a reference to the SharedKmer and the location within those two files.
Selection: a part of a file which has a match.
Fragment: a collection of one or more subsequent Matches. These will be visualized as one contiguous "block" of code which is shared between two files.
Intersection: a comparison between two files, collecting all the Fragments shared between them. The name is chosen because it contains the "overlap" in code between the files.
Analysis: which processes and manages the result of analyzing a collection of files (submissions), essentially a list of Intersections.
Comparison: the class which actually compares one file with the index of kmers. It takes a list of Files and returns an Analysis.

dodona-edu / dolos Goto Github PK

dolos's People

Contributors

Stargazers

Watchers

Forkers

dolos's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs