Comments (15)
at the time of commit 7a6fd96 all the dolos flags match the moss flags. So from now on the same flags for both applications are used.
from dolos.
Yeah sure, at the time of commit e7c8b14 running dolos with the same files as in my previous comment takes roughly 2s.
from dolos.
It seems that our algorithm can't handle the following code segment very well:
You should try to find out why. Start by looking at the raw results instead of the summary and see if we have matches for each of the lines. If not, try to play with the kmer length and filter strength.
from dolos.
at the time of commit 9661fc5 the highlights mostly match except for two notable differences:
- the scores given in moss does not match the scored from dolos.
- when comparing
copied_function.js
andsample.js
located insamples/js/
, both moss and dolos match the same lines except dolos does not include the return statement.
Moss result. The minimumLines
constant in summary.ts
was set to 2.
from dolos.
This compares the -d flag.
Both application are run with -l javascript -m 10 -d $(find ./samples/js/assignment1/ -type f)
. The structure of the dummy files are:
assignment1/
├── student1
│ ├── main.js
│ ├── sample.js
│ └── subDirectory
│ └── childClass.js
├── student2
│ ├── copied_function.js
│ ├── helperClasses
│ │ └── childClass.js
│ └── main.js
└── student3
├── another_copied_function.js
├── main.js
└── tempName
├── childClass.js
├── hello.js
└── subDir
└── subsubClass.js
Where everything except sample.js
, copied_function.js
and another_copied_function.js
is empty. Both copied_function.js
and another_copied_function.js
contain lines from sample.js
Here the results first three results from moss have a corresponding match within the dolos results. After that the moss results become very odd which I can only assume is a bug of sorts. Another noticable difference lies with the sorting, but this is intentional (descibed in #12 ).
from dolos.
Both programs were run with -l javascript -m 10
. When comparing 51 real exercises from student then moss takes ~11s while dolos takes ~26s. The output of dolos can't really be compared as the output of this contains 31875 lines.
from dolos.
Can you post an updated runtime with the new code? Be sure to only run the code and not to include the tsc compile step.
from dolos.
It seems that our algorithm can't handle the following code segment very well:
code block
let minuten = {
"00": "HET IS",
"05": "HET IS VIJF OVER",
"10": "HET IS TIEN OVER",
"15": "HET IS KWART OVER",
"20": "HET IS TIEN VOOR HALF",
"25": "HET IS VIJF VOOR HALF",
"30": "HET IS HALF",
"35": "HET IS VIJF OVER HALF",
"40": "HET IS TIEN OVER HALF",
"45": "HET IS KWART VOOR",
"50": "HET IS TIEN VOOR",
"55": "HET IS VIJF VOOR"
};
let uren = {
"01": "EEN",
"02": "TWEE",
"03": "DRIE",
"04": "VIER",
"05": "VIJF",
"06": "ZES",
"07": "ZEVEN",
"08": "ACHT",
"09": "NEGEN",
"10": "TIEN",
"11": "ELF",
"12": "TWAALF",
"13": "EEN",
"14": "TWEE",
"15": "DRIE",
"16": "VIER",
"17": "VIJF",
"18": "ZES",
"19": "ZEVEN",
"20": "ACHT",
"21": "NEGEN",
"22": "TIEN",
"23": "ELF",
"00": "TWAALF"
};
Dolos gets confused and splits it up in many small parts while Moss seems to handle it mostly fine.
from dolos.
Our results seem to contain a lot of doubles, with the file order reversed. These matches also include the same ranges as far as I can see.
The options that where used are: Options: -l javascript -m 10 -M 0.9 -c 'a happy comment' -s 1 -g 7 -o html -v 0
on a series of about 50 files.
from dolos.
Our results seem to contain a lot of doubles, with the file order reversed. These matches also include the same ranges as far as I can see.
Would it make sense to only report on the order with the highest value (e.g. max(score(a,b), score(b,a)))? That way we would get rid of the doubles and the results would become symmetric (which was an issue for the clustering?)
from dolos.
Our results seem to contain a lot of doubles, with the file order reversed. These matches also include the same ranges as far as I can see.
Would it make sense to only report on the order with the highest value (e.g. max(score(a,b), score(b,a)))? That way we would get rid of the doubles and the results would become symmetric (which was an issue for the clustering?)
I don't really think that is the best option here, because as far as I can see they are perfectly symmetrical in which case we can just filter them out. If they aren't symmetrical it would make more sense to join both matches into one, so it contains everything from both so no information is lost. I'll test if they are symmetrical and reply back here.
from dolos.
Our results seem to contain a lot of doubles, with the file order reversed. These matches also include the same ranges as far as I can see.
Would it make sense to only report on the order with the highest value (e.g. max(score(a,b), score(b,a)))? That way we would get rid of the doubles and the results would become symmetric (which was an issue for the clustering?)
I don't really think that is the best option here, because as far as I can see they are perfectly symmetrical in which case we can just filter them out. If they aren't symmetrical it would make more sense to join both matches into one, so it contains everything from both so no information is lost. I'll test if they are symmetrical and reply back here.
The results seem to be symmetric as far as can tell by the test results
I used node ./dist/app.js -l javascript -s 1 -o json samples/js/qlocktwo/*.js -g 0 -v 0 > temp.json
to generate the results and the following code to test the symmetry.
code
import fs from "fs";
import path from "path";
import { JSONFormatter, JSONSummaryFormat } from "./lib/jsonFormatter";
import { RangesTuple } from "./lib/summary";
import { Range } from "./lib/range"
(async () => {
const jsonResults: JSONSummaryFormat = JSON.parse(
fs.readFileSync(path.resolve("temp.json"), "utf8"),
JSONFormatter.JSONReviverFunction,
);
let reversed: number = 0;
let normal: number = 0;
const resultsMap: Map<string, RangesTuple[]> = new Map();
for (const group of jsonResults.results) {
for (let [file1, file2, matches] of group) {
if (file1 < file2) {
[file1, file2] = [file2, file1];
}
if (!resultsMap.has(file1 + file2)) {
normal += 1;
resultsMap.set(file1 + file2, matches);
} else {
reversed += 1;
const otherMatches: RangesTuple[] = resultsMap.get(file1 + file2) as RangesTuple[];
if( matches.length !== otherMatches.length || !areMatchesEqual(matches, otherMatches)) {
console.log(`${file1}'s and ${file2}'s results aren't symmetrical`);
}
}
}
}
console.log(normal, reversed);
})();
function areMatchesEqual(matches: RangesTuple[], mirroredMatches: RangesTuple[]): boolean {
for(const match of matches.values()) {
if(mirroredMatches.findIndex((potentialMatch) => areRangesTuplesMirroredEqual(match, potentialMatch)) === -1) {
console.log(match);
return false;
}
}
return true;
}
function areRangesTuplesMirroredEqual([r11, r12]: RangesTuple, [r21, r22]: RangesTuple): boolean {
return areRangesEqual(r11, r22) && areRangesEqual(r12, r21);
}
function areRangesEqual(range1: Range, range2: Range): boolean {
return range1.from === range2.from && range1.to === range2.to;
}
from dolos.
It seems that our algorithm can't handle the following code segment very well:
You should try to find out why. Start by looking at the raw results instead of the summary and see if we have matches for each of the lines. If not, try to play with the kmer length and filter strength.
After looking at the raw results it seems that each entry in the array is matched against every other entry, causing them to exceed the maximum hash count or maximum hash percentage.
There are the results, I used l javascript samples/js/qlocktwo/5348.js -m 10000 samples/js/qlocktwo/72.js -g 3
as options. I counted the occurrences of each line. so the first number is the line, the second the occurrences. We are interested in the lines starting from 64 to 103.
results
[0, 5] [2, 33] [3, 8] [4, 63] [6, 20] [8, 15] [9, 51] [10, 5] [12, 39] [13, 40] [14, 8] [15, 43] [16, 26] [21, 16] [22, 2] [23, 8] [24, 124] [26, 55] [29, 38] [30, 79] [31, 2] [33, 9] [34, 6] [35, 22] [36, 22] [37, 4] [38, 162] [39, 31] [40, 57] [41, 25] [42, 67] [46, 22] [47, 4] [48, 162] [49, 37] [50, 5] [51, 59] [52, 70] [53, 36] [57, 3] [59, 3] [62, 15] [63, 8] [65, 30] [66, 30] [67, 30] [68, 30] [69, 30] [70, 30] [71, 30] [72, 30] [73, 30] [74, 2] [75, 3] [78, 8] [80, 30] [81, 30] [82, 30] [83, 30] [84, 30] [85, 30] [86, 30] [87, 30] [88, 30] [89, 30] [90, 30] [91, 30] [92, 30] [93, 30] [94, 30] [95, 30] [96, 30] [97, 30] [98, 30] [99, 30] [100, 30] [101, 2] [102, 2] [104, 2] [105, 6] [106, 126] [107, 4] [108, 550] [109, 4] [110, 159] [111, 4] [112, 256] [113, 4] [114, 254] [115, 1] [117, 133] [118, 144] [119, 12] [120, 11] [123, 121] [125, 101] [126, 19] [128, 54] [130, 126] [131, 101] [133, 105] [135, 14] [136, 28] [137, 2] [138, 26] [142, 17] [143, 10] [144, 12] [145, 73] [146, 75] [147, 77] [148, 60] [150, 6] [151, 6] [152, 4] [153, 3] [154, 16] [155, 45] [156, 78] [157, 30] [158, 55] [159, 53] [160, 1] [161, 58] [162, 39] [163, 78] [164, 30] [165, 55] [166, 52] [167, 1] [172, 8] [173, 8] [174, 4] [175, 5] [176, 37] [177, 2] [179, 35] [180, 24] [181, 53] [182, 66] [183, 5] [184, 6] [185, 54] [186, 68] [187, 7] [188, 81] [189, 14] [190, 14] [191, 34] [192, 29] [193, 13] [194, 11] [195, 9] [199, 30] [200, 67] [201, 2] [203, 29] [204, 13] [205, 11] [206, 12] [208, 7] [209, 81] [210, 14] [211, 14] [212, 34] [213, 29] [214, 13] [215, 11] [216, 10] [220, 53]
from dolos.
Our results seem to contain a lot of doubles, with the file order reversed. These matches also include the same ranges as far as I can see.
Would it make sense to only report on the order with the highest value (e.g. max(score(a,b), score(b,a)))? That way we would get rid of the doubles and the results would become symmetric (which was an issue for the clustering?)
I don't really think that is the best option here, because as far as I can see they are perfectly symmetrical in which case we can just filter them out. If they aren't symmetrical it would make more sense to join both matches into one, so it contains everything from both so no information is lost. I'll test if they are symmetrical and reply back here.
The results seem to be symmetric as far as can tell by the test results
I usednode ./dist/app.js -l javascript -s 1 -o json samples/js/qlocktwo/*.js -g 0 -v 0 > temp.json
to generate the results and the following code to test the symmetry.
code
After some more testing I found that the results are not at all symmetrical. My last error worked with a filtered output causing the asymmetrical results to be removed. That said most of the asymmetrical results seem to be bad matches.
new code
import fs from "fs";
import path from "path";
import { CodeTokenizer } from "./lib/codeTokenizer";
import { Comparison, Matches } from "./lib/comparison";
import { JSONFormatter, JSONSummaryFormat } from "./lib/jsonFormatter";
import { Range } from "./lib/range";
import { FilterOptions, RangesTuple, Summary } from "./lib/summary";
(async () => {
const mapLocation: string = path.resolve("./samples/js/qlocktwo/");
const locations: string[] = fs
.readdirSync(mapLocation, "utf8")
.map(location => `${mapLocation}/${location}`);
const tokenizer = new CodeTokenizer("javascript");
const comparison = new Comparison(tokenizer, {
filterHashByPercentage: undefined,
maxHash: 200,
});
comparison.addFiles(locations);
const matchesPerFile: Map<string, Matches<number>> = await comparison.compareFiles(locations);
const filterOptions: FilterOptions = {
minimumFragmentLength: 1,
};
const summary = new Summary(matchesPerFile, 0, filterOptions, 0);
() => summary;
const jsonResults: JSONSummaryFormat = JSON.parse(
fs.readFileSync(path.resolve("temp.json"), "utf8"),
JSONFormatter.JSONReviverFunction,
);
testRawResults(matchesPerFile);
() => testSummaryResults(jsonResults.results); //TODO
})();
function testRawResults(results: Map<string, Matches<number>>): void {
let reversed: number = 0;
let normal: number = 0;
const resultsMap: Map<string, Array<[number, number]>> = new Map();
for (let [file1, matches] of results.entries()) {
for (let [file2, matchingLines] of matches.entries()) {
if (file1 < file2) {
[file1, file2] = [file2, file1];
}
if (!resultsMap.has(file1 + file2)) {
normal += 1;
resultsMap.set(file1 + file2, matchingLines);
} else {
reversed += 1;
const otherMatches: Array<[number, number]> = resultsMap.get(file1 + file2) as Array<[
number,
number,
]>;
if (!areMatchingLinesEqual(matchingLines, otherMatches)) {
console.log(
`${file1}'s (${matchingLines.length}) and ${file2}'s (${otherMatches.length}) results aren't symmetrical`,
);
}
}
}
console.log(normal, reversed);
}
}
function areMatchingLinesEqual(
lines1: Array<[number, number]>,
lines2: Array<[number, number]>,
): boolean {
for (const [l11, l12] of lines1.values()) {
if (lines2.findIndex(([l21, l22]) => l11 === l22 && l12 == l21) === -1) {
console.log(lines1);
return false;
}
}
return true;
}
function testSummaryResults(results: Array<Array<[string, string, RangesTuple[]]>>): void {
let reversed: number = 0;
let normal: number = 0;
const resultsMap: Map<string, RangesTuple[]> = new Map();
for (const group of results) {
for (let [file1, file2, matches] of group) {
if (file1 < file2) {
[file1, file2] = [file2, file1];
}
if (!resultsMap.has(file1 + file2)) {
normal += 1;
resultsMap.set(file1 + file2, matches);
} else {
reversed += 1;
const otherMatches: RangesTuple[] = resultsMap.get(file1 + file2) as RangesTuple[];
if (!areMatchesEqual(matches, otherMatches)) {
console.log(
`${file1}'s (${matches.length}) and ${file2}'s (${otherMatches.length}) results aren't symmetrical`,
);
}
}
}
}
console.log(normal, reversed);
}
function areMatchesEqual(matches: RangesTuple[], mirroredMatches: RangesTuple[]): boolean {
for (const match of matches.values()) {
if (
mirroredMatches.findIndex(potentialMatch =>
areRangesTuplesMirroredEqual(match, potentialMatch),
) === -1
) {
console.log(match);
return false;
}
}
return true;
}
function areRangesTuplesMirroredEqual([r11, r12]: RangesTuple, [r21, r22]: RangesTuple): boolean {
return areRangesEqual(r11, r22) && areRangesEqual(r12, r21);
}
function areRangesEqual(range1: Range, range2: Range): boolean {
return range1.from === range2.from && range1.to === range2.to;
}
from dolos.
This has been done in the upcoming publication.
from dolos.
Related Issues (20)
- Problems with docker-compose HOT 6
- Allow importing ZIP-archives from URL in Dolos API HOT 1
- Allow waiting for reports to finish
- Custom `tree-sitter-xxx` packages cannot be used anymore HOT 3
- Dolos API should return an URL to the front-end HOT 1
- Can’t host dolos web app publicly HOT 4
- Programming language request: Verilog
- dolos-web service in docker-compose runs on incorrect host and port HOT 7
- Update list of supported programming languages in web app HOT 1
- Update to tree-sitter 0.21
- Fix permissions to push dolos-cli docker image HOT 1
- File undefined when uploading new file in production HOT 1
- Add support for Groovy
- Self hosting Dolos using docker-compose HOT 8
- Export Language and descendants in dolos-lib HOT 1
- ActiveStorage URL's generated by Dolos API incorrect when hosted on a subdirectory using `relative_url_root`.
- Allow uploading individual files in the Web UI upload form HOT 2
- Importing template/boilerplate code - prefered way to communicate this to Dolos HOT 2
- Update a plagiarism detection report by adding or replacing submissions
- Invalid argument error when parsing large files HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dolos.