GithubHelp home page GithubHelp logo

dodona-edu / dolos Goto Github PK

View Code? Open in Web Editor NEW
227.0 6.0 29.0 42.3 MB

:detective: Source code plagiarism detection

Home Page: https://dolos.ugent.be

License: MIT License

TypeScript 44.66% JavaScript 1.07% HTML 0.11% Vue 35.14% SCSS 0.20% Dockerfile 0.28% Ruby 11.79% Nix 6.23% Shell 0.42% Python 0.10%
learn-to-code code-similarity plagiarism-checker plagiarism-detection education hacktoberfest dodona online-learning collusion-detection plagiarism

dolos's Introduction

Dolos

A plagiarism graph showing a lot of plagiarism.

Current version of the npm package DOI of the latest journal article about Dolos Public chat channel for Dolos MIT source code license

Dolos is a source code plagiarism detection tool for programming exercises. Dolos helps teachers in discovering students sharing solutions, even if they are modified. By providing interactive visualizations, Dolos can also be used to sensitize students to prevent plagiarism.

Dolos aims to be:

  • Easy to use by offering a web app with an intuitive user interface
  • Flexible to support many programming languages
  • Powerful by using state-of-the-art algorithms to help you discover plagiarism

Dolos is a web app that analyses source code files for similarities between them. In addition, it offers a command-line interface to run an analysis locally, showing the interactive user interface in your browser by launching a local webserver. The analysis results are available in machine readable CSV files and Dolos can be integrated as a JavaScript library in other applications empowering users to integrate plagiarism detection in their personal workflow.

You can use our free to use instance of Dolos on https://dolos.ugent.be.

Self-hosting Dolos

As Dolos is open source, it is also possible to host the Dolos web app.

Follow our instructions on https://dolos.ugent.be/docs.

Local installation with Dolos CLI

If you want to run the Dolos CLI instead of using the web app, you can install Dolos CLI your system using npm:

npm install -g @dodona/dolos

See the installation instructions on our website for more complete instructions.

Usage

Dolos can be launched using the command-line interface, but it is able to show the results in your browser.

Launch Dolos using the following command in your terminal:

dolos run -f web path/to/your/files/*

This will launch a web interface with the analysis results at http://localhost:3000.

More elaborate instructions on how to use Dolos.

Documentation

Visit our web page at https://dolos.ugent.be/docs.

Building and developing

You only need to run install the dependencies once in the repository root by running npm install. This will install all dependencies and link them in each project's node_modules. You should not run npm install in each project's directory separately.

This will also link the dist folder from the core, lib and web projects as their versions match in the package.json file. This allows you to simultaneously develop the CLI, lib and the web project together.

Each component has its own build instructions in its own directory.

Components

  • CLI: the command-line interface
  • Core: the Javascript library with only the core algorithms
  • Parsers: the tree-sitter parsers vendored by Dolos
  • Lib: the Node.js library which can parse and analyze files
  • Web: the graphical user interface in your browser which can be launched using the CLI
  • Docs: the source code of https://dolos.ugent.be
  • API: the API server running the Dolos web app at https://dolos.ugent.be/server

Who made this software?

Dolos is an active research project by Team Dodona at Ghent University. If you use this software for your research, please cite:

dolos's People

Contributors

arnecjacobs avatar aurisaudentis avatar baconandchips avatar bmesuere avatar chvp avatar dependabot-preview[bot] avatar dependabot[bot] avatar klassiker avatar maartenvn avatar pdawyndt avatar radim-kliment avatar renovate[bot] avatar rien avatar toonijn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

dolos's Issues

benchmark datasets

tokenizer test sometimes fails with "Illegal invocation"

 FAIL  src/lib/__tests__/tokenizer.test.ts
  ● Test suite failed to run

    TypeError: Illegal invocation



      at Object.get [as rootNode] (node_modules/tree-sitter/index.js:20:35)
      at Object.<anonymous> (node_modules/tree-sitter/index.js:16:26)
      at Object.<anonymous> (src/lib/codeTokenizer.ts:740:40)

clustering algorithm not ideal

The current clustering algorithm, which is based on the union-find algorithm, groups matches together even when the matches do not make the criteria to belong to the cluster. For example we have four files A, B, C and D and the matches between these are:

  • A with B
  • B with C
  • C with D
  • A with D

Let's say the first three have a high enough score to be grouped together, then we will have a single equivalence group. With all the four files in them. Now the fourth and final match does not have a high enough score for the two files to be considered equal, so nothing changes to the equivalent groups.
This is all as expected within union find. However this forms a problem for us as the fourth match will be grouped together with the first three despite it not having a high enough score to do so.
This will cause problems when one student copies many code fragments from many different people.
TL;DR: algorithm assumes that there is an equivalence relationship when there isn't one.

filter the comparison output

The output should be filtered in at least 3 ways:

  • exclude matches with the file itself (at post-processor)
  • matches with given boilerplate code (add a blacklist to Comparison)
  • common hashes matching many files. These are probably common constructs. The threshold for this could be user-defined. (at post-processor or when comparing)

Subtree matching algorithm

PR #78 implements a subtree matching algorithm as an interesting alternative to hash matching. But we need to compare these two techniques to pick which one to use.

@ISteampowered it would be useful to have a link to a description of this algorithm.

Allow AST tokens to match multiple areas in a file

We currently use a Selection class to show which area in a file corresponds to an AST token. This has a start and a stop location, but its child tokens will overlap with these selections. It would be ideal to have a way to keep track of which tracks belong to that node of the AST and which nodes will belong to its children.

The main reason why we want this: if only the function declaration matches between two files, we now highlight the full function. This gives a wrong impression that the whole file is matched. In this case only the declaration and the closing bracket should be highlighted.

We should think about how we'll model this. I see a few options:

  • Model the tokens as a tree (which shouldn't be difficult, the AST is already one) where we can extract this information from.
  • Keep track which selection(s) belong to the children. Is it enough to have just one inner selection, or should this be a list?

suggestion: abstract tokenizer

Now only a tokenizer for code is implemented. I suggest adding a tokenizer interface (like HashFilter) that can be extended for other tokenizers:

  • Tokenizer for text (or latex) where whitespaces and newlines are removed
  • Tokenizer for pdf or docx-files that extract text

I propose something along the lines of:

abstract class Tokenizer<Position> {
  public abstract tokenize(text: string): string;
  public abstract async tokenizeWithMapping(text: string): Promise<[string, Position[]]>;
  ...
} 

Here Position is a generic type that refers to a mapped position (e.g. a line number number for code, or a page and paragraph [number, number] for docx...)

The tokenizer for code would implement this as:

class CodeTokenizer extends Tokenizer<number> {
  ...
}

Drop support for Haskell

Unfortunately, the tree-sitter team has mentioned that the Haskell tree-sitter grammar is too complex and they have plans to rewrite it. But in the mean time it's no longer supported.

The Haskell grammar is currently stopping us from migrating to tree-sitter 0.16, which will allow us to use Node 12 again. So I propose that we (temporarily) drop support for Haskell until we have a working tree-sitter plugin for it again.

Web UI: Enhance analysis overview

The overview should give the users a good impression for each intersection if there is a possibility of plagiarism or not. There will possibly be a lot of entries in this overview so we should make it easy for users to save some progress.

Possible features:

  • Mark intersections as "certain plagiarism", "suspicious" and "innocent" (better tag names are possible).
  • Add notes to files or intersections.
  • Filter on these intersections.
  • Extension: Give a quick impression of which parts of the intersections are shared/unique (best modeled after the barcode chart of #230)

suffix tree-based methods for clone/plagiarm

Rename some classes

One of the hardest things in computer science.

We have the following classes (from bottom to top). I think names in bold need another name, but suggestions for the other classes are welcome as well.

  • File: a submission with source code belonging to a student.
  • SharedKmer: a substring of the AST which is shared by at least two files (which keeps a reference to those files).
  • Match: a connector between two matching files. It keeps a reference to the SharedKmer and the location within those two files.
  • Selection: a part of a file which has a match.
  • Fragment: a collection of one or more subsequent Matches. These will be visualized as one contiguous "block" of code which is shared between two files.
  • Intersection: a comparison between two files, collecting all the Fragments shared between them. The name is chosen because it contains the "overlap" in code between the files.
  • Analysis: which processes and manages the result of analyzing a collection of files (submissions), essentially a list of Intersections.
  • Comparison: the class which actually compares one file with the index of kmers. It takes a list of Files and returns an Analysis.

Authorship attribution

This came from a discussion with Annick, Rien and Pieter:

With plagiarism detection you can only detect similarity between code (either inside Dodona or available on the Internet). In fraud detection this still doesn't exclude that a user received external help (e.g. during an online test/exam). Therefore it would be interesting if we could extract an author signature from the submission history of a user and then detect deviations from the signature: within the timeline of submissions of a user for a single exercise (e.g. at some point the student got a solution from someone else) or within the timeline of all submissions of a user.

This issue collects some ideas and literature found around this topic.

filter output

filter output based on two things:

  • language boilerplate
  • exercise boilerplate

routing in html output

Implement something like this in the html generated by the html formatters so that it behaves more like people would expect from a website.

Visualize diff

Comparing two files is the main functionality of Dolos, so this should be a clear and easy-to-use component of our web UI.

Requirements:

  • code highlighting of a single file
    • the highlighter should be flexible enough so we can inject/extend the highlighting with our own styling
  • highlighting which parts are unique and which parts are shared
  • easy and intuitive navigation between two files
    • barcode chart: show a mini map of the file (fitting on one page) showing the parts that are shared (red) and unique (green); in order no to clutter the chart, we might only show the shared regions (hunks)
    • clicking on a shared part should scroll the common part of the other file into view; apart from aligning the two regions in the two files, we could also show a "diff" (in this case a diff between the original texts) between the two shared regions, highlighting their differences
    • if there are multiple matching parts in the other (or the current) file, they should be shown on the mini map when selected; there should be a way to toggle between these multiple parts
    • synchronous scrolling between both files (e.g. after aligning them on two hunks); Atom uses keyboard shortcut Ctrl-Alt-S to toggle scroll-sync on/off
  • an overview of the matches between two files with some basic information (how long, how many kmers); where matches can be selected (showing them on the barcode chart and code viewer) and ignored

fix test “tokenizer with or without location is equal”

Due to an (minor) update of tree-sitter-javascript, one of our tests fails and is disabled for now.
The PR that caused this change (and adds named fields) can be found at tree-sitter/tree-sitter-javascript#96.

We can change our custom output to also include these named fields (apparently a tree-sitter feature) or remove the test. In case of the latter, we could add an additional test to compare the output of a fixed sample file with a snapshot.

benchmark results

An issue, to aggregate the results and conclusions from the benchmark results.

HTMLFormatter: separate code, data and markup

When creating the HTMLFormatter we didn't want to depend on a templating engine because we don't need to have such a huge dependency for what we want to achieve. The HTML file is currently generated inline, which makes it unwieldy to make layout changes.

I propose the create an HTML template, which we populate using simple substitutions or a lightweight templating engine like JavasScript-Templates.

We should also look into the possibilities to add JavaScript or CSS dependencies to the generate HTML page, since we'll want to add visualizations and maybe a CSS framework in the future.

Optimize compare-view rendering

For larger files the rendering of a compare-view can get quite slow, for example the on hover highlighting might take about a second. There are two ways to optimize this:

  • only highlight the lines that are currently visible instead of all the lines in the file
  • optimize the spans so that not each span individually has to be highlighted but instead a parent spans is used for highlighting.

When this is done the first options should be tested first to see if it solves the problem.

Plagiarism graph

Each submission is a node and has links to nodes it has a similarity with above a certain threshold, creating a graph. If we have the moment of submission we can create a directed graph, giving an indication who plagiarized from whom.

split up app.ts into Dolos.ts (API) and cli.ts (CLI)

  • We should split up our app.ts into a CLI part (cli.ts) and a library part (dodona.ts). The CLI part is then responsible to create a configuration to call the library with.

Other things that need to be done in app.ts:

  • The options are currently described very concise. This causes them to span multiple lines. We could look into moving these explanations to a document (e.g. the README) and making the options descriptions themselves a bit shorter.
  • Update examples to reflect the current functionality

Allow for plain text matching

Add a setting to the user can choose if they want to use the code tree or just the plain text of the code itself for matching.

block list and information view

This component will be shown in the compare-view and will have several functionalities:

  • show a list of blocks that are currently present in the compare-view
  • allow the user to show/hide individual entries in this list
  • allow the user to cycle through this list
  • show information of the currently selected block, such as the matched kmer

List of all submissions that show similarity score above threshold

In addition to have a list with all pairs of submissions that have a similarity score above a given threshold, we could also show a list of all submissions that are part of at least one such a pair, together with some stats:

  • number of pairs the submission is involved in (number of edges in the graph)
  • highest similarity score among all other submissions (maybe with option to show diff with closest submission or with any of the other submissions); show this both as a visual bar and as a percentage; use this as a sorting criterion

Each submission will occur at most once in this list, whereas it might occur multiple times in a list showing all pairs. A shorter list might be easier to deal with than a list of pairs.

summarize the raw results

When performing a comparison, for each file, a list of matching files is returned. For each of those matching files, a list of matching lines is returned.

  • At the lowest level, the list of matches should be converted to a list of ranges
  • A similarity score per pair of files should be calculated (based on the number of matching lines or the number of matching k-mers)
  • When reporting, there are 2 options:
    • Pick a file and get the list of matching files, sorted from highest to lowest similarity
    • Show a list of the top pairwise comparisons (this is what moss does). While ok, I’m not a great fan of this because if there are bigger groups of cooperating students, these will take up a large portion at the top of the result table. Such groups should actually be grouped and not listed pair-wise.

Change output format from dolos analysis

The current output format puts all the fragment data into the intersections.csv file, while this is fine for smaller files this can get quite big for more realistic examples. Putting all the fragment data into the csv file has a considerable effect on the web UI as it is forced to load in all the fragments in all at once. Putting the this data into separate files would help resolve these issues.

show file errors

In the Comparison class, unreadable files are ignored and a warning is written to console.error:

console.error(`There was a problem parsing ${file}. ${error}`);

Once we can use Promise.allSettled() we should use it and keep track of which files we could not read and report it to the user.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.