GithubHelp home page GithubHelp logo

robyang1024 / ocamoss Goto Github PK

View Code? Open in Web Editor NEW
23.0 1.0 9.0 967 KB

Final project code for 3110, MOSS implementation using OCaml

License: MIT License

OCaml 50.44% Makefile 0.11% Standard ML 3.65% Java 22.16% C 5.35% Python 17.33% C++ 0.96%

ocamoss's Introduction

OCaMoss User's Guide

Plagiarism detection software, inspired by MOSS and implemented in OCaml. Runs using a command-line interface.

This is NOT an OCaml client for MOSS, this is a completely separate program. For more details about the system or how MOSS works in general, read the PDF report in this repository or this blog post I wrote.

Note: this was originally written as the final project for a course - it has since been updated by me, so some aspects of the PDF report may not be accurate. In particular, the latest version of OCaMOSS no longer uses a 2-3 tree.


  • To build - make
  • To build & run the REPL - make run
  • To build & run unit tests -make test

Required Dependencies:

  • Yojson
  • ANSITerminal
  • OUnit (for unit tests)

Commands:

(note - commands are case-sensitive)

  • run [threshold] - runs OCaMoss on the working directory. The threshold argument gives the program the percentage of the file to match with another for it to be flagged as plagiarised, and must be at least 0.4 and at most 1
  • dir - lists the working directory and the files that it contains
  • setdir [dir] - sets the relative directory to look for files and resets any results
  • results - lists the file names for which there are results
  • results [filename] - lists the detailed results of overlap for that file (Make sure to include the extension of the file)
  • resultpairs -- lists all the pairs of files for which there are positive results
  • compare [fileA] [fileB] - prints out specific overlaps of fileA and fileB (Make sure to include the extension of the files)
  • quit - exits the REPL
  • help - display the available commands

Usage instructions/tutorial:

  1. setdir to folder you want to test. requirements: file names have no spaces and all files have the same extension (example: setdir tests/test1)

  2. run with desired params (example: run 0.5 is the same as run)

  3. results to view list of results, [results filename] to view list of results for specific file, and [compare A B to compare matching patterns for two files (example: results Camel.txt)

    Example for runnning test case 1 and inspecting results:

    1. setdir tests/test1
    2. run
    3. results/results intset.ml/compare intset1.ml intset.ml/resultpairs

Other information:

Similarity score:

  • used as a measure of how likely file A plagiarized from file B
  • ratio of # matching hashes between A and B : # hashes in fingerprint for A
  • overall similarity score for A is the average of all similarity scores for file A that are > 0.5
  • threshold score for detecting possible plagiarism varies with the file type, but experimentally we determined it to be around 0.5

Supported languages/file formats:

  • OCaml - .ml
  • Java - .java
  • C - .c
  • Python - .py
  • English - .txt (note: english comparison does NOT account for semantics)

Self-generated test case descriptions (test case N is in directory tests/testN):

NOTE: to replicate results, run using threshold = 0.4

  1. exact duplicates - should return positive result
  2. variable names changed - should return positive result
  3. functions/comments reordered - should return positive result
  4. functions1.ml is a copy of functions.ml but with large sections deleted - should return positive result for functions1 but not functions
  5. different implementations of the same algorithm - should NOT return positive result
  6. completely different files - should NOT return positive result
  7. functions/comments reordered - should return positive result
  8. more than 2 files - files changed respectively as follows: function/variable names changed; random spaces/new lines added; rec declarations/ match statement lines changed - should return positive result for all files except for lab034.ml which is a dummy.
  9. more than 2 files - files changed respectively as follows: same comments but different code; comments deleted and same code with variable/function names changed - should NOT return positive result for first and should return positive result for second
  10. large group of all different files - should NOT return positive result
  11. txt files check - files are changed respectively as follows: exact wikipedia article; edited but very similar wikipedia article; sentences shifted around of original; exact same; a file that says “camel” five times; a more hazy edit of the original - should return positive result for all except last two: “Camels.txt” and “CamelMaybeCopy.txt”
  12. Java check - test for a Java file, where one file has all comments removed
  13. C check - test for a C file where one file has all comments removed
  14. Python check - test for a Python file, where one files has comments removed and variable names changed.

ocamoss's People

Contributors

aniroodh-ravikumar avatar gjain234 avatar mingboiz avatar robyang1024 avatar yangdanny97 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

ocamoss's Issues

Extend to other languages - C++

Hey there, love the work! I would like to fork - extending to C++ specifically and I was wondering if it's possible to give me some general directions on how I may do that? Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.