mantono / duplicatesearcher Goto Github PK

Identification of Duplicate Tickets in Issue Tracking Systems for Software Development

Java 99.10% Shell 0.73% CSS 0.16%

duplicatesearcher's Introduction

☕ I am a software developer that loves coffee and bicycling, with an affection for state machines, event sourcing and functional programming

duplicatesearcher's People

Contributors

Watchers

duplicatesearcher's Issues

Filter out pull requests from issues

From https://developer.github.com/v3/issues/#list-issues-for-a-repository

Note: In the past, pull requests and issues were more closely aligned than they are now. As far as the API is concerned, every pull request is an issue, but not every issue is a pull request.

This endpoint may also return pull requests in the response. If an issue is a pull request, the object will include a pull_request key.

We will have to check the pull_request key and remove pull requests from our issue collection, as they will contribute anything to our artefact.

Tokenization of Apostrophes

Currently any use of an apostrophe (') in input text will be replaced with a white space. This behaviour is not ideal since it will change the semantic meaning of text even before it's split up into to tokens. And worse is that it may result in words that do not exist and its influence on similarity among issues may be influenced since it might neither be spell corrected properly (maybe, maybe not) nor filtered by any stop list, even tough this mostly applies to common words which should be removed by stop lists.

Example

we're will be turned into we re
it's will be turned into it s

Option 1: The most naive approach is to remove apostrophes entirely any ignore and further complications by that, but it would sometimes have unintended side-effects as we're (we are) turning into were which is a completely different word. It is possible that these differences would be erased by use of stop lists if all the resulting words will be caught by those.

Option 2: Not remove apostrophes at all. This might be more realistic if stemming can be properly done on words that contains apostrophes. Most stop words lists seems to take apostrophes for granted in the tokens/words.

Option 3: Remove apostrophes on possessive form like Anton's --> Anton and whenever it is used together with she, he, it or similar together with is as with she's --> she. Finally, apostrophes in conjunction with are or will will be separated so you're ---> you are, we're --> we are, we'll --> we will. This option may require most work (but still not much more than three or four regular expressions, but may give the most accurate result.

Use Map to cache result of processing through IssueProcessor

Most Tokens being processed through the various TokenProcessor classes is probably being processed multiple time (unless a Token only occurs once in the entire repository). Caching the result of this processing and saving the result in a Map for later lookup could have a tremendous performance improvement since that operation is very quick.

This would mean that the current implementation of how IssueProcessor and TokenProcessor works would have to change slightly, since right a TermFrequencyCounter is passed around and processed rather than single tokens, but this should not be too hard to change.

Create HTML reports for experiment results of ExperimentEvaluator

Generate tables with reports of true positives, false positives and false negatives. It should also display

ProcessingFlags used for the experiment
Weighting parameters for Analyzer
Summary statistics as precision, recall and F1-score

ExperimentSetGenerator sometimes refer to pull requests as master issue

Consider a scenario like this one. A regular issue (11538) has a comment which refers to another issue (10348) as a master of this duplicate, only this issue isn't a issue, it's a pull request. But we remove pull requests, so when we try to retrieve the master of duplicate 11538, we get a NullPointerException.

Stemming

What should be done?
Stemming on a set of words (use Snowball?)

Issues it depends on: #6

Crawl GitHub for word frequency data

Spell correction

What should be done?
Correct misspelled words.

When is it finished?
When all words in the corpus have correct spelling (as far as possible).

Issues it depends on: #4

Parse code when tab is used instead of "`"

Synonyms lookup

What should be done?
Use synonyms for finding relevant issues.

Issues it depends on: #6

Add dynamic distance threshold to Levenshtein

Having threshold 2 or 3 as discussed is generally a good idea, but far from optimal on shorter words. If a shortening exists in an issue that does not exists in the dictionary it will always be "corrected" to a word of same or similar length if threshold value is the same or larger then the length of the word (I think).

Many times the shortened form of a word is intentional and correct, but if it does not exist is the dictionary if will always removed in favour for a word that exists in it. The token xx would for example be corrected to another value, but it would most likely replace both of the x's, resulting in a value with no resemblance of the original token. Such a conversion is not only a waste of resources, but will most likely have a negative impact on the final analysis when comparing content of issues.

Suggestion
The threshold value should depend on the length of the token being checked so really short tokens (size of 1 or 2) should not be spell corrected at all, and remaining sizes should have a lower threshold for short tokens. Preferably somewhere along the line of Math.ceil(Math.log(tokenLength)) or tokenLength/3.

Analysis

What should be done?
Analysis of all issues comparing them to each other (not neccessarily n²)

Input: A set of issues
Output: All issues which are duplicates of another issue in the input set.

Issues it depends on: #4

Tokenization

What should be done?
Transform all unstructured text data to a set of strings.

When is it finished?

All characters separated by non-letter character are broken down to words
- Whitespace
- Punctunations
- ":" and ";"
All input is transformed to lower case

Issues it depends on: #3

Stop list (general/common)

What should be done?
Implementing a stop list for common words in the English language.

Issues it depends on: #6

Implement experiment evaluator

Implement functionality to evaluate the result from an experiment. When a data set has been searched for duplicates, compare the found duplicates against the known duplicates.

Download master issues of downloaded duplicates

What should be done?
For each download issues that is marked as a duplicate and has a reference to another master issue, this issue must also be downloaded.

When is it finished?
When all downloaded duplicate issues have their master issue downloaded, or they are referencing a master issue through a chain of other duplicate issues. A master issue is a issue referenced from a duplicate issue with a link to the master issue and either the term duplicate or dupe.

Issues it depends on: #1, #3

Only issues with comments are saved

Since we use a map to save our issues with <Issue, List, this has the effect that only issues with a value (issues that has at least one comment) are saved. All issues that do not have a comment is ignored.

Filter out mentions/usernames from issues or comments

Some issues and comments contains mentions of other users like this: @mantono , which will cause a false similarity between issues were the same person is mentioned but nothing else is similar. A possible different solution would be to create another type of stop list consisting of all user names that are relevant for this repository.

Making an API request to repos/$OWNER/$REPO/contributors would yield a list of all users ever contributed to the repository, which should be a good start.

Remove label [Dd]duplicate on issue processing

Create new issue class specialized for analysis

Input: An instance of class Issue (egit version) and a Collection (egit version of comment)
Output: An instance of our Issue class, which can keep data in other forms more suitable for NLP.
Issues it depends on: #4

Create program for generation of human ceiling issues

Stop list (issue template)

What should be done?
Implementing a stop list from words present in the current ISSUE_TEMPLATE file for a repository.

Issues it depends on: #6

Enable passing weights as an argument thorugh main argument vector

IssueComponent field data in Report is missing when an Issue is reported multiple times

Improve performance of spell correction

Spell correction may work right now, but the performance is nowhere close to good enough for any real world application.

This is the result for roughly 30 issues;
Time required without spell correction for analysis: 0.677 seconds
Time required with spell correction for analysis: 18.8 seconds

The greatest problem right now is that the time complexity is O(n), which is detrimental on dictionaries of sizes 100,000 - 400,000 words.

Enable processing flags with command line flags

Enabling/disabling processing flags to the IssueProcessor is now done through editing the code directly. I would be preferable if this could be done from the command line instead when invoking the program.

Spell correction

What should be done?
Correct spelling on words that are not spelled correctly, if a correct version can be recognized.

When is it finished?
When a given input of a set of words containing

correctly spelled words
incorrectly spelled words
words that cannot be identified

returns

correctly spelled words
words that cannot be identified (flag these?)
with no duplicate occurrences of words (hence a set).

Issues it depends on: #4

Only one instance of duplicate can be saved for eash issue

Currently there is a many-to-many relation between a duplicate and its master, such as;

4 --> 5
4 --> 6
3 --> 6
3 --> 7

But allowing each issue to only be recorded once as a duplicate may reduce the amount of false duplicates significantly. This would instead give;

4 --> 5
3 --> 6

where the duplicate pair with highest cosine similarity would be kept, and possibly replacing earlier pair of duplicate --> master issues if a newer combination is found with higher similarity.

Stop list (GitHub terminology)

Create a static stop list with words commonly used in the domain of GitHub that can be expected to appear in issues regardless of the issue's characteristics. Examples are repository, issue, duplicate, dupe, pull, commit, GitHub and etc.
Do no forget to include GitHub emojis (basically everything with colons like 🎱 💯 🔢)

Issues it depends on: #6

Remove static stop list for issue template in favor for dynamic stop list

It really makes no sense in keeping the static one as the performance cannot possibly be better with it. It was only easier to implement than the dynamic issue template stop list, but since both now exists it we should probably remove the static version.

URLs are not properly parsed/removed

Parsing https://commons.apache.org/sandbox/commons-text/jacoco/org.apache.commons.text.similarity/LevenshteinDistance.java.html returns words [jacoco, text]

Filter out numbers from issues and comments

Numbers are not currently filtered out with the tokenizer, this is less than optimal and should be fixed.

Read and parse comments

Read data in comments as a complement to the data in the issue description. Apply the same techniques as used on the issue description, but do not add the words to the same set as they should probably be weighted different.

Issues it depends on: #3

Filter out URLs from issues and comments

URLs does not offer any additional value, especially since they are broken up into smaller tokens which does not keep the context or intent of posting the URL. Certain parts of it will rather be detrimental to the identification of algorithms, since almost every issue containing a URL will contain either http or https as a token, while have possibly nothing else in common. All URLs should therefore be filtered, however, it is important that any mention of http or https is kept when it is not part of a URL.

Check if an issue is viable for analysis or should be filtered

    /**
     * Check if this issue contains enough textual data to actually be analyzed
     * and compared to other issues (after stop lists and are applied)
     * 
     * @return true if it is considered viable for analysis, else false.
     */
    public boolean isViable()
    {
        // logic here
    }

Filter out code from comments and issues

Everything that's marked as code either with ` or ``` should be removed or saved separately.

Incosistent number of identified duplicates

See feef75e. Should be tested on smaller sets to see if the problem can be reproduced there.

IssueProcessor turns "mhz" into null

This is not what we had in mind.

Replace String on String operations

String is immutable and it is possible, considering the amount of data manipulation operations done on Strings, that a performance improvement can be done if a mutable class can be used instead.

Stop list (issue template) - Dynamic

Create a custom stop list from the ISSUE_TEMPLATE that is based on the creation date of an issue, since the issue template may change over time and not one single stop list may be relevant for all issues.

Issues it depends on: #7

Two issues of same author cannot be duplicates

(possibly unless they are made just right after each other, n and n+1)

I think these false positive in rust-lang/rust gives a pretty good idea of why this is needed...

0.8911434210043763 (15330 --> 17073)
0.9184277404963642 (15328 --> 17073)
0.8391425631371754 (17065 --> 17078)
0.8746758091425836 (17072 --> 17078)
0.7746694529930958 (17076 --> 17078)
0.8744661988225623 (17073 --> 17078)

Critical bug in ExperimentSetGenerator

Requested corpus size (500)
exGen.generateRandomIntervalSet(500, 0.3f, 0.6f);

0.7235879121264779 (392 --> 393)
0.8473679692833325 (795 --> 796)
0.7033872212896328 (22 --> 46)
0.9234488663176349 (8742 --> 8743)
Execution time:PT22.127S
Found duplicates: 4
Duplicates in corpus: 778
Precision: 1.0
Recall: 0.005141388174807198
F1-score: 0.010230179028132991

The amount of duplicates are larger than the requested set size.... Something is wrong...?

I think we calculate the F1-score on the entire data set rather than the one generated by the ExperimentSetGenerator. This gives us an entirely inccorect F1-Score (luckily, our recall will improve when we fix this).

Read issue data from GitHub and create Issue Instances

What should be done?
Parse downloaded data and create Issue instances (use framework).

When is it finished?
When each downloaded json issue is represented as Java object of class Instance.

Issues it depends on: #1

Change Report to generate filename based on timestamp

Refactor StrippedIssue

Change StrippedIssue to be more independent, if possible. Try to chose arrays in favor of FrequencyCounter inside the class. Make sure it is designed well and with simplicity in mind so it can conform to the need of fixing #22 when a map is not used. See also #18.

Download issues from GitHub

What should be done?
Implementing functionality for downloading issues from GitHub via its API.

When is it finished?

When we can download at least 2000 issues from a specific repository on a single request.
We must be able to chose if those issues are open or closed.

Compare issues on label

Use labels when comparing issues to each other.

Make sure we only use closed issues in data set generated by ExperimentSetGenerator

This may already be done, but I put it here as a reminder to check on it as an extra caution. This must not be forgotten.

Master issues of duplicate issues can still be open.

Make sure CharSequence is not used incorrecrly

public interface CharSequence

A CharSequence is a readable sequence of char values. This interface provides uniform, read-only access to many different kinds of char sequences. A char value represents a character in the Basic Multilingual Plane (BMP) or a surrogate. Refer to Unicode Character Representation for details.

This interface does not refine the general contracts of the equals and hashCode methods. The result of comparing two objects that implement CharSequence is therefore, in general, undefined. Each object may be implemented by a different class, and there is no guarantee that each class will be capable of testing its instances for equality with those of the other. It is therefore inappropriate to use arbitrary CharSequence instances as elements in a set or as keys in a map.

Graph based data structure

Create a graph based data structure where related issues will have edges. This is in order to reduce the time complexity for lookup/similarity comparison as an array based model would have O(n²), which does not scale well for larger repositories. Due to limited time, this is a low priority issue.

Issues it depends on: #11

mantono / duplicatesearcher Goto Github PK

duplicatesearcher's Introduction

duplicatesearcher's People

Contributors

Watchers

duplicatesearcher's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs