at the time of commit <a class="commit-link" data-hovercard-type="commit" data-hoverca

Yeah sure, at the time of commit <a class="commit-link" data-hovercard-type="commit" d

at the time of commit <a class="commit-link" data-hovercard-type="commit" data-hoverca

This compares the -d flag. Both application are run with <code class="notranslate"

Both programs were run with -l java -m 10 . When

Compare with MOSS results about dolos HOT 15 CLOSED

dodona-edu commented on June 30, 2024

Compare with MOSS results

from dolos.

Comments (15)

ArneCJacobs commented on June 30, 2024 1

at the time of commit 7a6fd96 all the dolos flags match the moss flags. So from now on the same flags for both applications are used.

from dolos.

ArneCJacobs commented on June 30, 2024 1

Yeah sure, at the time of commit e7c8b14 running dolos with the same files as in my previous comment takes roughly 2s.

from dolos.

bmesuere commented on June 30, 2024 1

It seems that our algorithm can't handle the following code segment very well:

You should try to find out why. Start by looking at the raw results instead of the summary and see if we have matches for each of the lines. If not, try to play with the kmer length and filter strength.

from dolos.

ArneCJacobs commented on June 30, 2024

at the time of commit 9661fc5 the highlights mostly match except for two notable differences:

the scores given in moss does not match the scored from dolos.
when comparing copied_function.js and sample.js located in samples/js/, both moss and dolos match the same lines except dolos does not include the return statement.

Moss result. The minimumLines constant in summary.ts was set to 2.

from dolos.

ArneCJacobs commented on June 30, 2024

This compares the -d flag.
Both application are run with -l javascript -m 10 -d $(find ./samples/js/assignment1/ -type f). The structure of the dummy files are:

assignment1/
├── student1
│   ├── main.js
│   ├── sample.js
│   └── subDirectory
│       └── childClass.js
├── student2
│   ├── copied_function.js
│   ├── helperClasses
│   │   └── childClass.js
│   └── main.js
└── student3
    ├── another_copied_function.js
    ├── main.js
    └── tempName
        ├── childClass.js
        ├── hello.js
        └── subDir
            └── subsubClass.js

Where everything except sample.js, copied_function.js and another_copied_function.js is empty. Both copied_function.js and another_copied_function.js contain lines from sample.js
Here the results first three results from moss have a corresponding match within the dolos results. After that the moss results become very odd which I can only assume is a bug of sorts. Another noticable difference lies with the sorting, but this is intentional (descibed in #12 ).

from dolos.

ArneCJacobs commented on June 30, 2024

Both programs were run with -l javascript -m 10. When comparing 51 real exercises from student then moss takes ~11s while dolos takes ~26s. The output of dolos can't really be compared as the output of this contains 31875 lines.

from dolos.

bmesuere commented on June 30, 2024

Can you post an updated runtime with the new code? Be sure to only run the code and not to include the tsc compile step.

from dolos.

ArneCJacobs commented on June 30, 2024

It seems that our algorithm can't handle the following code segment very well:

code block

        let minuten = {
            "00": "HET IS",
            "05": "HET IS VIJF OVER",
            "10": "HET IS TIEN OVER",
            "15": "HET IS KWART OVER",
            "20": "HET IS TIEN VOOR HALF",
            "25": "HET IS VIJF VOOR HALF",
            "30": "HET IS HALF",
            "35": "HET IS VIJF OVER HALF",
            "40": "HET IS TIEN OVER HALF",
            "45": "HET IS KWART VOOR",
            "50": "HET IS TIEN VOOR",
            "55": "HET IS VIJF VOOR"
        };

        let uren = {
            "01": "EEN",
            "02": "TWEE",
            "03": "DRIE",
            "04": "VIER",
            "05": "VIJF",
            "06": "ZES",
            "07": "ZEVEN",
            "08": "ACHT",
            "09": "NEGEN",
            "10": "TIEN",
            "11": "ELF",
            "12": "TWAALF",
            "13": "EEN",
            "14": "TWEE", 
            "15": "DRIE",
            "16": "VIER",
            "17": "VIJF",
            "18": "ZES",
            "19": "ZEVEN",
            "20": "ACHT",
            "21": "NEGEN",
            "22": "TIEN",
            "23": "ELF",
            "00": "TWAALF"
        };

Dolos gets confused and splits it up in many small parts while Moss seems to handle it mostly fine.

Dolos Results

Moss results.

from dolos.

ArneCJacobs commented on June 30, 2024

Our results seem to contain a lot of doubles, with the file order reversed. These matches also include the same ranges as far as I can see.

The options that where used are: Options: -l javascript -m 10 -M 0.9 -c 'a happy comment' -s 1 -g 7 -o html -v 0 on a series of about 50 files.

from dolos.

bmesuere commented on June 30, 2024

Our results seem to contain a lot of doubles, with the file order reversed. These matches also include the same ranges as far as I can see.

Would it make sense to only report on the order with the highest value (e.g. max(score(a,b), score(b,a)))? That way we would get rid of the doubles and the results would become symmetric (which was an issue for the clustering?)

from dolos.

ArneCJacobs commented on June 30, 2024

Our results seem to contain a lot of doubles, with the file order reversed. These matches also include the same ranges as far as I can see.

Would it make sense to only report on the order with the highest value (e.g. max(score(a,b), score(b,a)))? That way we would get rid of the doubles and the results would become symmetric (which was an issue for the clustering?)

I don't really think that is the best option here, because as far as I can see they are perfectly symmetrical in which case we can just filter them out. If they aren't symmetrical it would make more sense to join both matches into one, so it contains everything from both so no information is lost. I'll test if they are symmetrical and reply back here.

from dolos.

ArneCJacobs commented on June 30, 2024

Our results seem to contain a lot of doubles, with the file order reversed. These matches also include the same ranges as far as I can see.

Would it make sense to only report on the order with the highest value (e.g. max(score(a,b), score(b,a)))? That way we would get rid of the doubles and the results would become symmetric (which was an issue for the clustering?)

I don't really think that is the best option here, because as far as I can see they are perfectly symmetrical in which case we can just filter them out. If they aren't symmetrical it would make more sense to join both matches into one, so it contains everything from both so no information is lost. I'll test if they are symmetrical and reply back here.

The results seem to be symmetric as far as can tell by the test results
I used node ./dist/app.js -l javascript -s 1 -o json samples/js/qlocktwo/*.js -g 0 -v 0 > temp.json to generate the results and the following code to test the symmetry.

code

import fs from "fs";
import path from "path";
import { JSONFormatter, JSONSummaryFormat } from "./lib/jsonFormatter";
import { RangesTuple } from "./lib/summary";
import { Range } from "./lib/range"

(async () => {
  const jsonResults: JSONSummaryFormat = JSON.parse(
    fs.readFileSync(path.resolve("temp.json"), "utf8"),
    JSONFormatter.JSONReviverFunction,
  );
  let reversed: number = 0;
  let normal: number = 0;
  const resultsMap: Map<string, RangesTuple[]> = new Map();
  for (const group of jsonResults.results) {
    for (let [file1, file2, matches] of group) {
      if (file1 < file2) {
        [file1, file2] = [file2, file1];
      }
      if (!resultsMap.has(file1 + file2)) {
          normal += 1;
        resultsMap.set(file1 + file2, matches);
      } else {
          reversed += 1;
        const otherMatches: RangesTuple[] = resultsMap.get(file1 + file2) as RangesTuple[];
        if( matches.length !== otherMatches.length || !areMatchesEqual(matches, otherMatches)) {
            console.log(`${file1}'s and ${file2}'s results aren't symmetrical`);
        }
      }
    }
  }
  console.log(normal, reversed);
})();

function areMatchesEqual(matches: RangesTuple[], mirroredMatches: RangesTuple[]): boolean {
    for(const match of matches.values()) {
        if(mirroredMatches.findIndex((potentialMatch) => areRangesTuplesMirroredEqual(match, potentialMatch)) === -1) {
            console.log(match);
            return false;
        }
    }
    return true;
}

function areRangesTuplesMirroredEqual([r11, r12]: RangesTuple, [r21, r22]: RangesTuple): boolean {
    return areRangesEqual(r11, r22) && areRangesEqual(r12, r21);
}

function areRangesEqual(range1: Range, range2: Range): boolean {
    return range1.from === range2.from && range1.to === range2.to;
}

from dolos.

ArneCJacobs commented on June 30, 2024

It seems that our algorithm can't handle the following code segment very well:

You should try to find out why. Start by looking at the raw results instead of the summary and see if we have matches for each of the lines. If not, try to play with the kmer length and filter strength.

After looking at the raw results it seems that each entry in the array is matched against every other entry, causing them to exceed the maximum hash count or maximum hash percentage.
There are the results, I used l javascript samples/js/qlocktwo/5348.js -m 10000 samples/js/qlocktwo/72.js -g 3 as options. I counted the occurrences of each line. so the first number is the line, the second the occurrences. We are interested in the lines starting from 64 to 103.

results

[0, 5]
[2, 33]
[3, 8]
[4, 63]
[6, 20]
[8, 15]
[9, 51]
[10, 5]
[12, 39]
[13, 40]
[14, 8]
[15, 43]
[16, 26]
[21, 16]
[22, 2]
[23, 8]
[24, 124]
[26, 55]
[29, 38]
[30, 79]
[31, 2]
[33, 9]
[34, 6]
[35, 22]
[36, 22]
[37, 4]
[38, 162]
[39, 31]
[40, 57]
[41, 25]
[42, 67]
[46, 22]
[47, 4]
[48, 162]
[49, 37]
[50, 5]
[51, 59]
[52, 70]
[53, 36]
[57, 3]
[59, 3]
[62, 15]
[63, 8]
[65, 30]
[66, 30]
[67, 30]
[68, 30]
[69, 30]
[70, 30]
[71, 30]
[72, 30]
[73, 30]
[74, 2]
[75, 3]
[78, 8]
[80, 30]
[81, 30]
[82, 30]
[83, 30]
[84, 30]
[85, 30]
[86, 30]
[87, 30]
[88, 30]
[89, 30]
[90, 30]
[91, 30]
[92, 30]
[93, 30]
[94, 30]
[95, 30]
[96, 30]
[97, 30]
[98, 30]
[99, 30]
[100, 30]
[101, 2]
[102, 2]
[104, 2]
[105, 6]
[106, 126]
[107, 4]
[108, 550]
[109, 4]
[110, 159]
[111, 4]
[112, 256]
[113, 4]
[114, 254]
[115, 1]
[117, 133]
[118, 144]
[119, 12]
[120, 11]
[123, 121]
[125, 101]
[126, 19]
[128, 54]
[130, 126]
[131, 101]
[133, 105]
[135, 14]
[136, 28]
[137, 2]
[138, 26]
[142, 17]
[143, 10]
[144, 12]
[145, 73]
[146, 75]
[147, 77]
[148, 60]
[150, 6]
[151, 6]
[152, 4]
[153, 3]
[154, 16]
[155, 45]
[156, 78]
[157, 30]
[158, 55]
[159, 53]
[160, 1]
[161, 58]
[162, 39]
[163, 78]
[164, 30]
[165, 55]
[166, 52]
[167, 1]
[172, 8]
[173, 8]
[174, 4]
[175, 5]
[176, 37]
[177, 2]
[179, 35]
[180, 24]
[181, 53]
[182, 66]
[183, 5]
[184, 6]
[185, 54]
[186, 68]
[187, 7]
[188, 81]
[189, 14]
[190, 14]
[191, 34]
[192, 29]
[193, 13]
[194, 11]
[195, 9]
[199, 30]
[200, 67]
[201, 2]
[203, 29]
[204, 13]
[205, 11]
[206, 12]
[208, 7]
[209, 81]
[210, 14]
[211, 14]
[212, 34]
[213, 29]
[214, 13]
[215, 11]
[216, 10]
[220, 53]

from dolos.

ArneCJacobs commented on June 30, 2024

Our results seem to contain a lot of doubles, with the file order reversed. These matches also include the same ranges as far as I can see.

Would it make sense to only report on the order with the highest value (e.g. max(score(a,b), score(b,a)))? That way we would get rid of the doubles and the results would become symmetric (which was an issue for the clustering?)

I don't really think that is the best option here, because as far as I can see they are perfectly symmetrical in which case we can just filter them out. If they aren't symmetrical it would make more sense to join both matches into one, so it contains everything from both so no information is lost. I'll test if they are symmetrical and reply back here.

The results seem to be symmetric as far as can tell by the test results
I used node ./dist/app.js -l javascript -s 1 -o json samples/js/qlocktwo/*.js -g 0 -v 0 > temp.json to generate the results and the following code to test the symmetry.
code

After some more testing I found that the results are not at all symmetrical. My last error worked with a filtered output causing the asymmetrical results to be removed. That said most of the asymmetrical results seem to be bad matches.

new code

import fs from "fs";
import path from "path";
import { CodeTokenizer } from "./lib/codeTokenizer";
import { Comparison, Matches } from "./lib/comparison";
import { JSONFormatter, JSONSummaryFormat } from "./lib/jsonFormatter";
import { Range } from "./lib/range";
import { FilterOptions, RangesTuple, Summary } from "./lib/summary";

(async () => {
  const mapLocation: string = path.resolve("./samples/js/qlocktwo/");
  const locations: string[] = fs
    .readdirSync(mapLocation, "utf8")
    .map(location => `${mapLocation}/${location}`);

  const tokenizer = new CodeTokenizer("javascript");

  const comparison = new Comparison(tokenizer, {
    filterHashByPercentage: undefined,
    maxHash: 200,
  });
  comparison.addFiles(locations);

  const matchesPerFile: Map<string, Matches<number>> = await comparison.compareFiles(locations);

  const filterOptions: FilterOptions = {
    minimumFragmentLength: 1,
  };
  const summary = new Summary(matchesPerFile, 0, filterOptions, 0);
  () => summary;

  const jsonResults: JSONSummaryFormat = JSON.parse(
    fs.readFileSync(path.resolve("temp.json"), "utf8"),
    JSONFormatter.JSONReviverFunction,
  );

  testRawResults(matchesPerFile);
  () => testSummaryResults(jsonResults.results); //TODO
})();

function testRawResults(results: Map<string, Matches<number>>): void {
  let reversed: number = 0;
  let normal: number = 0;
  const resultsMap: Map<string, Array<[number, number]>> = new Map();
  for (let [file1, matches] of results.entries()) {
    for (let [file2, matchingLines] of matches.entries()) {
      if (file1 < file2) {
        [file1, file2] = [file2, file1];
      }
      if (!resultsMap.has(file1 + file2)) {
        normal += 1;
        resultsMap.set(file1 + file2, matchingLines);
      } else {
        reversed += 1;
        const otherMatches: Array<[number, number]> = resultsMap.get(file1 + file2) as Array<[
          number,
          number,
        ]>;
        if (!areMatchingLinesEqual(matchingLines, otherMatches)) {
          console.log(
            `${file1}'s (${matchingLines.length}) and ${file2}'s (${otherMatches.length}) results aren't symmetrical`,
          );
        }
      }
    }
    console.log(normal, reversed);
  }
}

function areMatchingLinesEqual(
  lines1: Array<[number, number]>,
  lines2: Array<[number, number]>,
): boolean {
  for (const [l11, l12] of lines1.values()) {
    if (lines2.findIndex(([l21, l22]) => l11 === l22 && l12 == l21) === -1) {
      console.log(lines1);
      return false;
    }
  }
  return true;
}

function testSummaryResults(results: Array<Array<[string, string, RangesTuple[]]>>): void {
  let reversed: number = 0;
  let normal: number = 0;
  const resultsMap: Map<string, RangesTuple[]> = new Map();
  for (const group of results) {
    for (let [file1, file2, matches] of group) {
      if (file1 < file2) {
        [file1, file2] = [file2, file1];
      }
      if (!resultsMap.has(file1 + file2)) {
        normal += 1;
        resultsMap.set(file1 + file2, matches);
      } else {
        reversed += 1;
        const otherMatches: RangesTuple[] = resultsMap.get(file1 + file2) as RangesTuple[];
        if (!areMatchesEqual(matches, otherMatches)) {
          console.log(
            `${file1}'s (${matches.length}) and ${file2}'s (${otherMatches.length}) results aren't symmetrical`,
          );
        }
      }
    }
  }
  console.log(normal, reversed);
}

function areMatchesEqual(matches: RangesTuple[], mirroredMatches: RangesTuple[]): boolean {
  for (const match of matches.values()) {
    if (
      mirroredMatches.findIndex(potentialMatch =>
        areRangesTuplesMirroredEqual(match, potentialMatch),
      ) === -1
    ) {
      console.log(match);
      return false;
    }
  }
  return true;
}

function areRangesTuplesMirroredEqual([r11, r12]: RangesTuple, [r21, r22]: RangesTuple): boolean {
  return areRangesEqual(r11, r22) && areRangesEqual(r12, r21);
}

function areRangesEqual(range1: Range, range2: Range): boolean {
  return range1.from === range2.from && range1.to === range2.to;
}

from dolos.

rien commented on June 30, 2024

This has been done in the upcoming publication.

from dolos.

Compare with MOSS results about dolos HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs