GithubHelp home page GithubHelp logo

dlidstrom / duplo Goto Github PK

View Code? Open in Web Editor NEW
92.0 9.0 21.0 223 KB

Duplicates finder for various source code formats.

License: GNU General Public License v2.0

C++ 74.71% XSLT 5.53% CMake 0.47% Shell 11.29% C 5.65% Dockerfile 0.68% Ada 1.68%
c-plus-plus code-quality duplicate-detection

duplo's Issues

Unique Duplicate LOC count

I was doing some testing with Duplo. There is situation where my Total_lines_of_code is smaller than Duplicate_lines_of_code.

XML Output is broken

There are no xml heading tags written into output.xml.
I'm suggesting the following change in Duplo.cpp in order to fix the issue:

@@ -350,6 +350,14 @@
     }
 
     std::cout << "Loading and hashing files ... " << std::flush;
+
+    if (options.GetOutputXml()) {
+        outfile
+            << "<?xml version=\"1.0\" encoding=\"UTF-8\"?>"
+            << std::endl
+            << "<duplo>"
+            << std::endl;
+    }
 
     auto lines = LoadFileList(options.GetListFilename());
     auto [sourceFiles, matrix, files, locsTotal] =

P.S. Thank you for the great software :)

Duplo parameters configuration

Hi! Is there a way to change the configuration parameters? (e.g. changing the minimal block size, minimal blocks number) How can eventually be done?

Docker image

Like this?

FROM alpine:3.11 AS build

RUN apk --no-cache add \
    alpine-sdk cmake

RUN mkdir -p /usr/src/ && \
    git clone https://github.com/dlidstrom/Duplo /usr/src/Duplo

WORKDIR /usr/src/Duplo

RUN mkdir build && cd build && cmake .. && cd .. && make

FROM scratch

WORKDIR /app
COPY --from=build /usr/src/Duplo/build/duplo .

ENTRYPOINT ["./duplo"]

See this.

Version from tag

A few pieces are required:

cmake .. -DDUPLO_VERSION='"0.10.0"'
add_compile_definitions(DUPLO_VERSION=${DUPLO_VERSION})

Single file duplication not detected

This file duplication is not detected:

AAAAA
BBBBB
CCCCC
DDDDD
EEEEE
/* some comment to offset the line numbers */
AAAAA
BBBBB
CCCCC
DDDDD
EEEEE

I've traced the problem into Duplo.cpp.

Output:

Loading and hashing files ... 1 done.

tests/Simple/LineNumbers.cMinBlockSize: 1
Found match at 0
Found match at 11
Found match at 22
Found match at 33
Found match at 44
Found match at 55
Found match at 66
Found match at 77
Found match at 88
Found match at 99
Should be here
line1: 0 line2: 0
Found match at 50
Found match at 61
Found match at 72
Found match at 83
Found match at 94
Should be here
line1: 5 line2: 5
 nothing found.

bug.diff.txt

Output real block length (not only filtered)

The following output says "set LineCount="10"" where '10' is "effective" length of the duplicated block (i.e. excluding empty lines, comments, defines etc.).

It would be nice to output real block lengths as well (i.e. with no excludes), for example like this:

<set LineCount="10">
<block SourceFile="/home/osboxes/Work/qvge/src/3rdParty/ogdf-2020/src/ogdf/uml/PlanRepUML.cpp" StartLineNumber="1201" LineCount="13"/>
<block SourceFile="/home/osboxes/Work/qvge/src/3rdParty/ogdf-2020/src/ogdf/uml/PlanRepUML.cpp" StartLineNumber="903" LineCount="15"/>

Ensure at least minimum number of lines in analysis

Change the algorithm to this:

// support reporting filtering by both:
// - "lines of code duplicated", &
// - "percentage of file duplicated"
const unsigned int lMinBlockSize = std::max(
    m_minBlockSize, std::min(
        m_minBlockSize, 
        (std::max(n,m)*100)/m_blockPercentThreshold
    )
);

Try to use the Clamp function.

Support for gcc assembly

Only in Duplo-S: ArgumentParser.o
Only in Duplo-S: ArgumentParser.s
Only in Duplo-S: duplo
diff -urp Duplo-master/Duplo.cpp Duplo-S/Duplo.cpp
--- Duplo-master/Duplo.cpp  2010-09-23 15:13:43.000000000 +0800
+++ Duplo-S/Duplo.cpp   2013-10-29 09:08:00.056444810 +0800
@@ -18,6 +18,7 @@

 #include <fstream>
 #include <time.h>
+#include <string.h>

 #include "SourceFile.h"
 #include "SourceLine.h"
Only in Duplo-S: Duplo.o
Only in Duplo-S: Duplo.s
Only in Duplo-S: files.txt
diff -urp Duplo-master/FileType.cpp Duplo-S/FileType.cpp
--- Duplo-master/FileType.cpp   2010-09-23 15:13:43.000000000 +0800
+++ Duplo-S/FileType.cpp    2013-10-29 09:29:16.924524640 +0800
@@ -7,6 +7,7 @@ static const std::string FileTypeExtn_CX
 static const std::string FileTypeExtn_H = "h";
 static const std::string FileTypeExtn_HPP = "hpp";
 static const std::string FileTypeExtn_Java = "java";
+static const std::string FileTypeExtn_S = "s";
 static const std::string FileTypeExtn_CS = "cs";
 static const std::string FileTypeExtn_VB = "vb";

@@ -51,6 +52,10 @@ FileType::FILETYPE FileType::GetFileType
     {
         return FILETYPE_JAVA;
     }
+    if (!FileExtn.compare(FileTypeExtn_S))
+    {
+        return FILETYPE_S;
+    }
     else if (!FileExtn.compare(FileTypeExtn_CS))
     {
         return FILETYPE_CS;
diff -urp Duplo-master/FileType.h Duplo-S/FileType.h
--- Duplo-master/FileType.h 2010-09-23 15:13:43.000000000 +0800
+++ Duplo-S/FileType.h  2013-10-29 09:14:44.484443676 +0800
@@ -16,6 +16,7 @@ public:
         FILETYPE_H,
         FILETYPE_HPP,
         FILETYPE_JAVA,
+        FILETYPE_S,
         FILETYPE_CS,
         FILETYPE_VB
     };
Only in Duplo-S: FileType.o
Only in Duplo-S: FileType.s
Only in Duplo-S: HashUtil.o
Only in Duplo-S: HashUtil.s
diff -urp Duplo-master/Makefile Duplo-S/Makefile
--- Duplo-master/Makefile   2010-09-23 15:13:43.000000000 +0800
+++ Duplo-S/Makefile    2013-10-29 09:05:57.292526981 +0800
@@ -2,7 +2,7 @@
 CC = g++

 # Flags
-CXXFLAGS = -O3
+CXXFLAGS = -Os -fno-rtti -fno-exceptions
 LDFLAGS =  ${CXXFLAGS}

 # Define what extensions we use
Only in Duplo-master: output.txt
Only in Duplo-S: out.txt
diff -urp Duplo-master/SourceFile.cpp Duplo-S/SourceFile.cpp
--- Duplo-master/SourceFile.cpp 2010-09-23 15:13:43.000000000 +0800
+++ Duplo-S/SourceFile.cpp  2013-10-29 15:21:30.708524135 +0800
@@ -67,6 +67,10 @@ SourceFile::SourceFile(const std::string
             tmp = line;
         }

+        if (FileType::FILETYPE_S    == m_FileType)
+            tmp.assign(line,0,line.find(";"));
+
+
        std::string cleaned;
        getCleanLine(tmp, cleaned);

@@ -100,6 +104,11 @@ void SourceFile::getCleanLine(const std:
                     return;
                 }
                 break;
+            case FileType::FILETYPE_S   :
+                if(i < lineSize-1 && line[i] == ';'){
+                    return;
+                }
+                break;
         }
         cleanedLine.push_back(line[i]);
     }
@@ -160,6 +169,14 @@ bool SourceFile::isSourceLine(const std:
              return std::string::npos == tmp.find(PreProc_VB.c_str(), 0, PreProc_VB.length());
           }
           break;
+
+       case FileType::FILETYPE_S   :
+          {
+              const std::string PreProc_S = "ret"; //we can't deduplicate ret AFAIK
+              return std::string::npos == tmp.find(PreProc_S.c_str(), 0, PreProc_S.length());
+         }
+          break;
+
        }
     }

Only in Duplo-S: SourceFile.o
Only in Duplo-S: SourceFile.s
Only in Duplo-S: SourceLine.o
Only in Duplo-S: SourceLine.s
Only in Duplo-S: StringUtil.o
Only in Duplo-S: StringUtil.s
Only in Duplo-S: TextFile.o
Only in Duplo-S: TextFile.s

Checking arbitrary files doesn't work

It currently skips FILETYPE_UNKNOWN files. I'd like to use it on any arbitrary text file.

This patch seems to work:

diff --git a/SourceFile.cpp b/SourceFile.cpp
index 18b5a80..fb6082e 100755
--- a/SourceFile.cpp
+++ b/SourceFile.cpp
@@ -63,7 +63,8 @@ SourceFile::SourceFile(const std::string& fileName, const unsigned int minChars,
                 }
             }
         }
-        if (FileType::FILETYPE_VB == m_FileType) {
+        if (FileType::FILETYPE_VB == m_FileType ||^M
+            FileType::FILETYPE_UNKNOWN == m_FileType) {^M
             tmp = line;
         }

Why multiple duplicates hasn't been merged?

for example:

file_a.c line 100 duplicates to file_a.c line 200 and file_a.c line 500 and file_a.c line 600, expecte the result to be

file_a.c line 100 file_a.c line 200 file_a.c 500 are duplicated other than:

line 100 duplicates to line 200, line 500 duplicates to line 600.

[feature-request] Ability to ignore license/author/version header in source files

Is it common, expecially in C and C++ source codes, to prepend a fixed header with licensing, author, copyright or version information. This header is often present in all the source/header files.

Example:

/*******************************************************************************
* One line to give the program's name and an idea of what it does.
* Copyright (C) yyyy  name of author
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
*******************************************************************************/

Duplo should be able to detect headers like these and strip them before checking for duplicates.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.