GithubHelp home page GithubHelp logo

Compare with MOSS results about dolos HOT 15 CLOSED

dodona-edu avatar dodona-edu commented on June 30, 2024
Compare with MOSS results

from dolos.

Comments (15)

ArneCJacobs avatar ArneCJacobs commented on June 30, 2024 1

at the time of commit 7a6fd96 all the dolos flags match the moss flags. So from now on the same flags for both applications are used.

from dolos.

ArneCJacobs avatar ArneCJacobs commented on June 30, 2024 1

Yeah sure, at the time of commit e7c8b14 running dolos with the same files as in my previous comment takes roughly 2s.

from dolos.

bmesuere avatar bmesuere commented on June 30, 2024 1

It seems that our algorithm can't handle the following code segment very well:

You should try to find out why. Start by looking at the raw results instead of the summary and see if we have matches for each of the lines. If not, try to play with the kmer length and filter strength.

from dolos.

ArneCJacobs avatar ArneCJacobs commented on June 30, 2024

at the time of commit 9661fc5 the highlights mostly match except for two notable differences:

  • the scores given in moss does not match the scored from dolos.
  • when comparing copied_function.js and sample.js located in samples/js/, both moss and dolos match the same lines except dolos does not include the return statement.

Moss result. The minimumLines constant in summary.ts was set to 2.

from dolos.

ArneCJacobs avatar ArneCJacobs commented on June 30, 2024

This compares the -d flag.
Both application are run with -l javascript -m 10 -d $(find ./samples/js/assignment1/ -type f). The structure of the dummy files are:

assignment1/
├── student1
│   ├── main.js
│   ├── sample.js
│   └── subDirectory
│       └── childClass.js
├── student2
│   ├── copied_function.js
│   ├── helperClasses
│   │   └── childClass.js
│   └── main.js
└── student3
    ├── another_copied_function.js
    ├── main.js
    └── tempName
        ├── childClass.js
        ├── hello.js
        └── subDir
            └── subsubClass.js

Where everything except sample.js, copied_function.js and another_copied_function.js is empty. Both copied_function.js and another_copied_function.js contain lines from sample.js
Here the results first three results from moss have a corresponding match within the dolos results. After that the moss results become very odd which I can only assume is a bug of sorts. Another noticable difference lies with the sorting, but this is intentional (descibed in #12 ).

from dolos.

ArneCJacobs avatar ArneCJacobs commented on June 30, 2024

Both programs were run with -l javascript -m 10. When comparing 51 real exercises from student then moss takes ~11s while dolos takes ~26s. The output of dolos can't really be compared as the output of this contains 31875 lines.

from dolos.

bmesuere avatar bmesuere commented on June 30, 2024

Can you post an updated runtime with the new code? Be sure to only run the code and not to include the tsc compile step.

from dolos.

ArneCJacobs avatar ArneCJacobs commented on June 30, 2024

It seems that our algorithm can't handle the following code segment very well:

code block

        let minuten = {
            "00": "HET IS",
            "05": "HET IS VIJF OVER",
            "10": "HET IS TIEN OVER",
            "15": "HET IS KWART OVER",
            "20": "HET IS TIEN VOOR HALF",
            "25": "HET IS VIJF VOOR HALF",
            "30": "HET IS HALF",
            "35": "HET IS VIJF OVER HALF",
            "40": "HET IS TIEN OVER HALF",
            "45": "HET IS KWART VOOR",
            "50": "HET IS TIEN VOOR",
            "55": "HET IS VIJF VOOR"
        };

        let uren = {
            "01": "EEN",
            "02": "TWEE",
            "03": "DRIE",
            "04": "VIER",
            "05": "VIJF",
            "06": "ZES",
            "07": "ZEVEN",
            "08": "ACHT",
            "09": "NEGEN",
            "10": "TIEN",
            "11": "ELF",
            "12": "TWAALF",
            "13": "EEN",
            "14": "TWEE", 
            "15": "DRIE",
            "16": "VIER",
            "17": "VIJF",
            "18": "ZES",
            "19": "ZEVEN",
            "20": "ACHT",
            "21": "NEGEN",
            "22": "TIEN",
            "23": "ELF",
            "00": "TWAALF"
        };

Dolos gets confused and splits it up in many small parts while Moss seems to handle it mostly fine.

Dolos Results

DolosResults

Moss results.

MossResults

from dolos.

ArneCJacobs avatar ArneCJacobs commented on June 30, 2024

Our results seem to contain a lot of doubles, with the file order reversed. These matches also include the same ranges as far as I can see.
DolosDoubles
The options that where used are: Options: -l javascript -m 10 -M 0.9 -c 'a happy comment' -s 1 -g 7 -o html -v 0 on a series of about 50 files.

from dolos.

bmesuere avatar bmesuere commented on June 30, 2024

Our results seem to contain a lot of doubles, with the file order reversed. These matches also include the same ranges as far as I can see.

Would it make sense to only report on the order with the highest value (e.g. max(score(a,b), score(b,a)))? That way we would get rid of the doubles and the results would become symmetric (which was an issue for the clustering?)

from dolos.

ArneCJacobs avatar ArneCJacobs commented on June 30, 2024

Our results seem to contain a lot of doubles, with the file order reversed. These matches also include the same ranges as far as I can see.

Would it make sense to only report on the order with the highest value (e.g. max(score(a,b), score(b,a)))? That way we would get rid of the doubles and the results would become symmetric (which was an issue for the clustering?)

I don't really think that is the best option here, because as far as I can see they are perfectly symmetrical in which case we can just filter them out. If they aren't symmetrical it would make more sense to join both matches into one, so it contains everything from both so no information is lost. I'll test if they are symmetrical and reply back here.

from dolos.

ArneCJacobs avatar ArneCJacobs commented on June 30, 2024

Our results seem to contain a lot of doubles, with the file order reversed. These matches also include the same ranges as far as I can see.

Would it make sense to only report on the order with the highest value (e.g. max(score(a,b), score(b,a)))? That way we would get rid of the doubles and the results would become symmetric (which was an issue for the clustering?)

I don't really think that is the best option here, because as far as I can see they are perfectly symmetrical in which case we can just filter them out. If they aren't symmetrical it would make more sense to join both matches into one, so it contains everything from both so no information is lost. I'll test if they are symmetrical and reply back here.

The results seem to be symmetric as far as can tell by the test results
I used node ./dist/app.js -l javascript -s 1 -o json samples/js/qlocktwo/*.js -g 0 -v 0 > temp.json to generate the results and the following code to test the symmetry.

code

import fs from "fs";
import path from "path";
import { JSONFormatter, JSONSummaryFormat } from "./lib/jsonFormatter";
import { RangesTuple } from "./lib/summary";
import { Range } from "./lib/range"

(async () => {
  const jsonResults: JSONSummaryFormat = JSON.parse(
    fs.readFileSync(path.resolve("temp.json"), "utf8"),
    JSONFormatter.JSONReviverFunction,
  );
  let reversed: number = 0;
  let normal: number = 0;
  const resultsMap: Map<string, RangesTuple[]> = new Map();
  for (const group of jsonResults.results) {
    for (let [file1, file2, matches] of group) {
      if (file1 < file2) {
        [file1, file2] = [file2, file1];
      }
      if (!resultsMap.has(file1 + file2)) {
          normal += 1;
        resultsMap.set(file1 + file2, matches);
      } else {
          reversed += 1;
        const otherMatches: RangesTuple[] = resultsMap.get(file1 + file2) as RangesTuple[];
        if( matches.length !== otherMatches.length || !areMatchesEqual(matches, otherMatches)) {
            console.log(`${file1}'s and ${file2}'s results aren't symmetrical`);
        }
      }
    }
  }
  console.log(normal, reversed);
})();

function areMatchesEqual(matches: RangesTuple[], mirroredMatches: RangesTuple[]): boolean {
    for(const match of matches.values()) {
        if(mirroredMatches.findIndex((potentialMatch) => areRangesTuplesMirroredEqual(match, potentialMatch)) === -1) {
            console.log(match);
            return false;
        }
    }
    return true;
}

function areRangesTuplesMirroredEqual([r11, r12]: RangesTuple, [r21, r22]: RangesTuple): boolean {
    return areRangesEqual(r11, r22) && areRangesEqual(r12, r21);
}

function areRangesEqual(range1: Range, range2: Range): boolean {
    return range1.from === range2.from && range1.to === range2.to;
}

from dolos.

ArneCJacobs avatar ArneCJacobs commented on June 30, 2024

It seems that our algorithm can't handle the following code segment very well:

You should try to find out why. Start by looking at the raw results instead of the summary and see if we have matches for each of the lines. If not, try to play with the kmer length and filter strength.

After looking at the raw results it seems that each entry in the array is matched against every other entry, causing them to exceed the maximum hash count or maximum hash percentage.
There are the results, I used l javascript samples/js/qlocktwo/5348.js -m 10000 samples/js/qlocktwo/72.js -g 3 as options. I counted the occurrences of each line. so the first number is the line, the second the occurrences. We are interested in the lines starting from 64 to 103.

results
[0, 5]
[2, 33]
[3, 8]
[4, 63]
[6, 20]
[8, 15]
[9, 51]
[10, 5]
[12, 39]
[13, 40]
[14, 8]
[15, 43]
[16, 26]
[21, 16]
[22, 2]
[23, 8]
[24, 124]
[26, 55]
[29, 38]
[30, 79]
[31, 2]
[33, 9]
[34, 6]
[35, 22]
[36, 22]
[37, 4]
[38, 162]
[39, 31]
[40, 57]
[41, 25]
[42, 67]
[46, 22]
[47, 4]
[48, 162]
[49, 37]
[50, 5]
[51, 59]
[52, 70]
[53, 36]
[57, 3]
[59, 3]
[62, 15]
[63, 8]
[65, 30]
[66, 30]
[67, 30]
[68, 30]
[69, 30]
[70, 30]
[71, 30]
[72, 30]
[73, 30]
[74, 2]
[75, 3]
[78, 8]
[80, 30]
[81, 30]
[82, 30]
[83, 30]
[84, 30]
[85, 30]
[86, 30]
[87, 30]
[88, 30]
[89, 30]
[90, 30]
[91, 30]
[92, 30]
[93, 30]
[94, 30]
[95, 30]
[96, 30]
[97, 30]
[98, 30]
[99, 30]
[100, 30]
[101, 2]
[102, 2]
[104, 2]
[105, 6]
[106, 126]
[107, 4]
[108, 550]
[109, 4]
[110, 159]
[111, 4]
[112, 256]
[113, 4]
[114, 254]
[115, 1]
[117, 133]
[118, 144]
[119, 12]
[120, 11]
[123, 121]
[125, 101]
[126, 19]
[128, 54]
[130, 126]
[131, 101]
[133, 105]
[135, 14]
[136, 28]
[137, 2]
[138, 26]
[142, 17]
[143, 10]
[144, 12]
[145, 73]
[146, 75]
[147, 77]
[148, 60]
[150, 6]
[151, 6]
[152, 4]
[153, 3]
[154, 16]
[155, 45]
[156, 78]
[157, 30]
[158, 55]
[159, 53]
[160, 1]
[161, 58]
[162, 39]
[163, 78]
[164, 30]
[165, 55]
[166, 52]
[167, 1]
[172, 8]
[173, 8]
[174, 4]
[175, 5]
[176, 37]
[177, 2]
[179, 35]
[180, 24]
[181, 53]
[182, 66]
[183, 5]
[184, 6]
[185, 54]
[186, 68]
[187, 7]
[188, 81]
[189, 14]
[190, 14]
[191, 34]
[192, 29]
[193, 13]
[194, 11]
[195, 9]
[199, 30]
[200, 67]
[201, 2]
[203, 29]
[204, 13]
[205, 11]
[206, 12]
[208, 7]
[209, 81]
[210, 14]
[211, 14]
[212, 34]
[213, 29]
[214, 13]
[215, 11]
[216, 10]
[220, 53]

from dolos.

ArneCJacobs avatar ArneCJacobs commented on June 30, 2024

Our results seem to contain a lot of doubles, with the file order reversed. These matches also include the same ranges as far as I can see.

Would it make sense to only report on the order with the highest value (e.g. max(score(a,b), score(b,a)))? That way we would get rid of the doubles and the results would become symmetric (which was an issue for the clustering?)

I don't really think that is the best option here, because as far as I can see they are perfectly symmetrical in which case we can just filter them out. If they aren't symmetrical it would make more sense to join both matches into one, so it contains everything from both so no information is lost. I'll test if they are symmetrical and reply back here.

The results seem to be symmetric as far as can tell by the test results
I used node ./dist/app.js -l javascript -s 1 -o json samples/js/qlocktwo/*.js -g 0 -v 0 > temp.json to generate the results and the following code to test the symmetry.
code

After some more testing I found that the results are not at all symmetrical. My last error worked with a filtered output causing the asymmetrical results to be removed. That said most of the asymmetrical results seem to be bad matches.

new code

import fs from "fs";
import path from "path";
import { CodeTokenizer } from "./lib/codeTokenizer";
import { Comparison, Matches } from "./lib/comparison";
import { JSONFormatter, JSONSummaryFormat } from "./lib/jsonFormatter";
import { Range } from "./lib/range";
import { FilterOptions, RangesTuple, Summary } from "./lib/summary";

(async () => {
  const mapLocation: string = path.resolve("./samples/js/qlocktwo/");
  const locations: string[] = fs
    .readdirSync(mapLocation, "utf8")
    .map(location => `${mapLocation}/${location}`);

  const tokenizer = new CodeTokenizer("javascript");

  const comparison = new Comparison(tokenizer, {
    filterHashByPercentage: undefined,
    maxHash: 200,
  });
  comparison.addFiles(locations);

  const matchesPerFile: Map<string, Matches<number>> = await comparison.compareFiles(locations);

  const filterOptions: FilterOptions = {
    minimumFragmentLength: 1,
  };
  const summary = new Summary(matchesPerFile, 0, filterOptions, 0);
  () => summary;

  const jsonResults: JSONSummaryFormat = JSON.parse(
    fs.readFileSync(path.resolve("temp.json"), "utf8"),
    JSONFormatter.JSONReviverFunction,
  );

  testRawResults(matchesPerFile);
  () => testSummaryResults(jsonResults.results); //TODO
})();

function testRawResults(results: Map<string, Matches<number>>): void {
  let reversed: number = 0;
  let normal: number = 0;
  const resultsMap: Map<string, Array<[number, number]>> = new Map();
  for (let [file1, matches] of results.entries()) {
    for (let [file2, matchingLines] of matches.entries()) {
      if (file1 < file2) {
        [file1, file2] = [file2, file1];
      }
      if (!resultsMap.has(file1 + file2)) {
        normal += 1;
        resultsMap.set(file1 + file2, matchingLines);
      } else {
        reversed += 1;
        const otherMatches: Array<[number, number]> = resultsMap.get(file1 + file2) as Array<[
          number,
          number,
        ]>;
        if (!areMatchingLinesEqual(matchingLines, otherMatches)) {
          console.log(
            `${file1}'s (${matchingLines.length}) and ${file2}'s (${otherMatches.length}) results aren't symmetrical`,
          );
        }
      }
    }
    console.log(normal, reversed);
  }
}

function areMatchingLinesEqual(
  lines1: Array<[number, number]>,
  lines2: Array<[number, number]>,
): boolean {
  for (const [l11, l12] of lines1.values()) {
    if (lines2.findIndex(([l21, l22]) => l11 === l22 && l12 == l21) === -1) {
      console.log(lines1);
      return false;
    }
  }
  return true;
}

function testSummaryResults(results: Array<Array<[string, string, RangesTuple[]]>>): void {
  let reversed: number = 0;
  let normal: number = 0;
  const resultsMap: Map<string, RangesTuple[]> = new Map();
  for (const group of results) {
    for (let [file1, file2, matches] of group) {
      if (file1 < file2) {
        [file1, file2] = [file2, file1];
      }
      if (!resultsMap.has(file1 + file2)) {
        normal += 1;
        resultsMap.set(file1 + file2, matches);
      } else {
        reversed += 1;
        const otherMatches: RangesTuple[] = resultsMap.get(file1 + file2) as RangesTuple[];
        if (!areMatchesEqual(matches, otherMatches)) {
          console.log(
            `${file1}'s (${matches.length}) and ${file2}'s (${otherMatches.length}) results aren't symmetrical`,
          );
        }
      }
    }
  }
  console.log(normal, reversed);
}

function areMatchesEqual(matches: RangesTuple[], mirroredMatches: RangesTuple[]): boolean {
  for (const match of matches.values()) {
    if (
      mirroredMatches.findIndex(potentialMatch =>
        areRangesTuplesMirroredEqual(match, potentialMatch),
      ) === -1
    ) {
      console.log(match);
      return false;
    }
  }
  return true;
}

function areRangesTuplesMirroredEqual([r11, r12]: RangesTuple, [r21, r22]: RangesTuple): boolean {
  return areRangesEqual(r11, r22) && areRangesEqual(r12, r21);
}

function areRangesEqual(range1: Range, range2: Range): boolean {
  return range1.from === range2.from && range1.to === range2.to;
}

from dolos.

rien avatar rien commented on June 30, 2024

This has been done in the upcoming publication.

from dolos.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.