GithubHelp home page GithubHelp logo

dlidstrom / duplo Goto Github PK

View Code? Open in Web Editor NEW
90.0 9.0 21.0 223 KB

Duplicates finder for various source code formats.

License: GNU General Public License v2.0

C++ 74.71% XSLT 5.53% CMake 0.47% Shell 11.29% C 5.65% Dockerfile 0.68% Ada 1.68%
c-plus-plus code-quality duplicate-detection

duplo's Introduction

Duplo - Duplicate Source Code Block Finder

C/C++ CI

Updates:

🔥 v0.8 adds improved Java support

🙌 Help needed! See 8.3 on how to support more languages.

Table of Contents:

1. General Information

Duplicated source code blocks can harm maintainability of software systems. Duplo is a tool to find duplicated code blocks in large code bases. Duplo has special support for some programming languages, meaning it can filter out (multi-line) comments and compiler directives. For example: C, C++, Java, C#, and VB.NET. Any other text format is also supported.

2. Maintainer

Duplo was originally developed by Christian M. Ammann and is now maintained and developed by Daniel Lidström.

3. File Format Support

Duplo has built in support for the following file formats:

  • C/C++ (.c, .cpp, .cxx, .h, .hpp)
  • Java
  • C#
  • VB
  • GCC assembly
  • Ada

This means that Duplo will remove preprocessor directives, block comments, using statements, etc, to only consider duplicates in actual code. In addition, Duplo can be used as a general (without special support) duplicates detector in arbitrary text files and will even detect duplicates found in the same file.

Sample output snippet:

...
src\engine\geometry\simple\TorusGeometry.cpp(56)
src\engine\geometry\simple\SphereGeometry.cpp(54)
    pBuffer[currentIndex*size+3]=(i+1)/(float)subdsU;
    pBuffer[currentIndex*size+4]=j/(float)subdsV;
    currentIndex++;
    pPrimitiveBuffer->unlock();

src\engine\geometry\subds\SubDsGeometry.cpp(37)
src\engine\geometry\SkinnedMeshGeometry.cpp(45)
    pBuffer[i*size+0]=m_ct[0]->m_pColors[i*3];
    pBuffer[i*size+1]=m_ct[0]->m_pColors[i*3+1];
    pBuffer[i*size+2]=m_ct[0]->m_pColors[i*3+2];
...

4. Installation

4.1. Docker

If you have Docker, the way to run Duplo is to use this command:

# Docker on unix
> docker run --rm -i -w /src -v $(pwd):/src dlidstrom/duplo

This pulls the latest image and runs duplo. Note that you'll have to pipe the filenames into this command. A complete commandline sample will be shown below.

4.2. Pre-built binaries

Duplo is also available as a pre-built binary for (alpine) linux and macos. Grab the executable from the releases page.

You can of course build from source as well, and you'll have to do so to get a binary for Windows.

5. Usage

Duplo works with a list of files. You can either specify a file that contains the list of files, or you can pass them using stdin.

Run duplo --help on the command line to see the detailed options.

5.1. Passing files using stdin

In each of the following commands, duplo will write the duplicated blocks into out.txt in addition to the information written to stdout.

5.1.1. Bash

# unix
> find . -type f \( -iname "*.cpp" -o -iname "*.h" \) | duplo - out.txt

Let's break this down. find . -type f \( -iname "*.cpp" -o -iname "*.h" \) is a syntax to look recursively in the current directory (the . part) for files (the -type f part) matching *.cpp or *.h (case insensitive). The output from find is piped into duplo which then reads the filenames from stdin (the - tells duplo to get the filenames from stdin, a common unix convention in many commandline applications). The result of the analysis is then written to out.txt.

5.1.2. Windows

# windows
> Get-ChildItem -Include "*.cpp", "*.h" -Recurse | % { $_.FullName } | Duplo.exe - out.txt

This works similarly to the Bash command, but uses PowerShell commands to achieve the same effect.

5.1.3. Docker

# Docker on unix
> find . -type f \( -iname "*.cpp" -or -iname "*.h" \) | docker run --rm -i -w /src -v $(pwd):/src dlidstrom/duplo - out.txt

This command also works in a similar fashion to the Bash command, but instead of piping into a local duplo executable, it will pipe into duplo running inside Docker. This is very convenient as you do not have to install duplo separately. You will have to install Docker though, if you haven't already. That is a good thing to do anyway, since it opens up a lot of possibilities apart from running duplo.

Again, similarly to the Bash command, this uses find to find files in the current directory, then passes the file list to Docker which will pass it further into an instance of the latest version of duplo. The working directory in the duplo container should be /src (that's where the duplo executable is located) and the current path of your host machine will be mapped to /src when the container is running. The -i allows stdin of your host machine to be passed into Docker to allow duplo to read the filenames. Any parameters to duplo can be placed at the end of the command as you can see - out.txt has been.

5.2. Passing files using file

duplo can analyze files specified in a separate file:

# unix
> find . -type f \( -iname "*.cpp" -o -iname "*.h" \) > files.lst
> duplo files.lst out.txt

# windows
> Get-ChildItem -Include "*.cpp", "*.h" -Recurse |  % { $_.FullName } | Out-File -encoding ascii files.lst
> Duplo.exe files.lst out.txt

# Docker on unix
> find . -type f \( -iname "*.cpp" -o -iname "*.h" \) > files.lst
> docker run --rm -i -w /src -v $(pwd):/src dlidstrom/duplo files.lst out.txt

Again, the duplicated blocks are written to out.txt.

5.3. Xml output

Duplo can also output xml and there is a stylesheet that will format the result for viewing in a browser. This can be used as a report tab in your continuous integration tool (GitHub Actions, TeamCity, etc).

6. Feedback and Bug Reporting

Please open an issue to discuss feedback, feature requests and bug reports.

7. Algorithm Background

Duplo uses the same techniques as Duploc to detect duplicated code blocks. See Duca99bCodeDuplication for further information.

7.1. Performance Measurements

System Files Loc's Time
Quake2 266 102740 18sec

8. Developing

8.1. Unix

You need CMake and preferrably fswatch for the best experience.

# build dependencies
/> brew install cmake
/> brew install fswatch

Compiling is best done using the continuous file watcher:

# CMake builds in the build folder
/> mkdir build
/> pushd build
build/> cmake ..
# now issue make
build/> make
build/> popd
# continuous build can now be used in root folder
# (needs fswatch)
> ./watch.sh

8.2. Windows

Use Visual Studio 2019 to open the included solution file (or try CMake).

8.3. Additional Language Support

Duplo can analyze all text files regardless of format, but it has special support for some programming languages (C++, C#, Java, for example). This allows Duplo to improve the duplication detection as it can ignore preprocessor directives and/or comments.

To implement support for a new language, there are a couple of options:

  1. Implement FileTypeBase which has support for handling comments and preprocessor directives. You just need to decide what is a comment. With this option you need to implement a couple of methods, one which is CreateLineFilter. This is to remove multiline comments. Look at CstyleCommentsFilter for an example.
  2. Implement IFileType interface directly. This gives you the most freedom but also is the hardest option.

You can see an example of how Java support was added effortlessly. It involves copying an existing file type implementation and adjusting the lines that should be filtered and how comments should be removed. Finally, add a few lines in FileTypeFactory.cpp to choose the correct implementation based on the file extension. Refer to this commit for all the details.

8.4. Language Suggestions

  • JavaScript (easy, just look at the existing C-based ones)
  • Ruby
  • Python
  • Perl
  • PHP
  • Rust
  • F#
  • Scala
  • Haskell
  • Erlang
  • What else?

Send me a pull request!

9. Changes

  • 0.8
    • Add support for Java which was lost or never there in the first place
  • 0.7
  • 0.6
    • Improved integration with duploq
  • 0.5
  • 0.4
    • Significant performance improvements
    • Using modern C++ techniques
    • Modularized to simplify adding support of new file formats
    • Can pass files using stdin
  • 0.3
    • Updated links in html output to GitHub
    • Support for gcc assembly (.s)
    • Fixed minimum number of lines in analysis
    • Fixed limitation of total number of lines of code
    • Checking of arbitrary files

10. Accompanying Software

For a pretty ui you should check out duploq by ArsMasiuk:

Duploq

From duploq's Readme file:

duploq's approach is a pretty straighforward. First, duploq allows you to choose where to look for the duplicates (files or folders). Then it builds list of input files and passes it to the Duplo engine together with necessary parameters. After the files have been processed, duploq parses Duplo's output and visualises the results in easy and intuitive way. Also it provides additional statistics information which is not a part of Duplo output.

11. License

Duplo is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

Duplo is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Duplo; if not, write to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA


12. Stargazers over time

Stargazers over time

duplo's People

Contributors

arsmasiuk avatar codemonkey-uk avatar dlidstrom avatar knaldgas avatar ozbolt-abrantix avatar sangmo-kang avatar sonofusion82 avatar tyler97 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

duplo's Issues

Support for gcc assembly

Only in Duplo-S: ArgumentParser.o
Only in Duplo-S: ArgumentParser.s
Only in Duplo-S: duplo
diff -urp Duplo-master/Duplo.cpp Duplo-S/Duplo.cpp
--- Duplo-master/Duplo.cpp  2010-09-23 15:13:43.000000000 +0800
+++ Duplo-S/Duplo.cpp   2013-10-29 09:08:00.056444810 +0800
@@ -18,6 +18,7 @@

 #include <fstream>
 #include <time.h>
+#include <string.h>

 #include "SourceFile.h"
 #include "SourceLine.h"
Only in Duplo-S: Duplo.o
Only in Duplo-S: Duplo.s
Only in Duplo-S: files.txt
diff -urp Duplo-master/FileType.cpp Duplo-S/FileType.cpp
--- Duplo-master/FileType.cpp   2010-09-23 15:13:43.000000000 +0800
+++ Duplo-S/FileType.cpp    2013-10-29 09:29:16.924524640 +0800
@@ -7,6 +7,7 @@ static const std::string FileTypeExtn_CX
 static const std::string FileTypeExtn_H = "h";
 static const std::string FileTypeExtn_HPP = "hpp";
 static const std::string FileTypeExtn_Java = "java";
+static const std::string FileTypeExtn_S = "s";
 static const std::string FileTypeExtn_CS = "cs";
 static const std::string FileTypeExtn_VB = "vb";

@@ -51,6 +52,10 @@ FileType::FILETYPE FileType::GetFileType
     {
         return FILETYPE_JAVA;
     }
+    if (!FileExtn.compare(FileTypeExtn_S))
+    {
+        return FILETYPE_S;
+    }
     else if (!FileExtn.compare(FileTypeExtn_CS))
     {
         return FILETYPE_CS;
diff -urp Duplo-master/FileType.h Duplo-S/FileType.h
--- Duplo-master/FileType.h 2010-09-23 15:13:43.000000000 +0800
+++ Duplo-S/FileType.h  2013-10-29 09:14:44.484443676 +0800
@@ -16,6 +16,7 @@ public:
         FILETYPE_H,
         FILETYPE_HPP,
         FILETYPE_JAVA,
+        FILETYPE_S,
         FILETYPE_CS,
         FILETYPE_VB
     };
Only in Duplo-S: FileType.o
Only in Duplo-S: FileType.s
Only in Duplo-S: HashUtil.o
Only in Duplo-S: HashUtil.s
diff -urp Duplo-master/Makefile Duplo-S/Makefile
--- Duplo-master/Makefile   2010-09-23 15:13:43.000000000 +0800
+++ Duplo-S/Makefile    2013-10-29 09:05:57.292526981 +0800
@@ -2,7 +2,7 @@
 CC = g++

 # Flags
-CXXFLAGS = -O3
+CXXFLAGS = -Os -fno-rtti -fno-exceptions
 LDFLAGS =  ${CXXFLAGS}

 # Define what extensions we use
Only in Duplo-master: output.txt
Only in Duplo-S: out.txt
diff -urp Duplo-master/SourceFile.cpp Duplo-S/SourceFile.cpp
--- Duplo-master/SourceFile.cpp 2010-09-23 15:13:43.000000000 +0800
+++ Duplo-S/SourceFile.cpp  2013-10-29 15:21:30.708524135 +0800
@@ -67,6 +67,10 @@ SourceFile::SourceFile(const std::string
             tmp = line;
         }

+        if (FileType::FILETYPE_S    == m_FileType)
+            tmp.assign(line,0,line.find(";"));
+
+
        std::string cleaned;
        getCleanLine(tmp, cleaned);

@@ -100,6 +104,11 @@ void SourceFile::getCleanLine(const std:
                     return;
                 }
                 break;
+            case FileType::FILETYPE_S   :
+                if(i < lineSize-1 && line[i] == ';'){
+                    return;
+                }
+                break;
         }
         cleanedLine.push_back(line[i]);
     }
@@ -160,6 +169,14 @@ bool SourceFile::isSourceLine(const std:
              return std::string::npos == tmp.find(PreProc_VB.c_str(), 0, PreProc_VB.length());
           }
           break;
+
+       case FileType::FILETYPE_S   :
+          {
+              const std::string PreProc_S = "ret"; //we can't deduplicate ret AFAIK
+              return std::string::npos == tmp.find(PreProc_S.c_str(), 0, PreProc_S.length());
+         }
+          break;
+
        }
     }

Only in Duplo-S: SourceFile.o
Only in Duplo-S: SourceFile.s
Only in Duplo-S: SourceLine.o
Only in Duplo-S: SourceLine.s
Only in Duplo-S: StringUtil.o
Only in Duplo-S: StringUtil.s
Only in Duplo-S: TextFile.o
Only in Duplo-S: TextFile.s

Version from tag

A few pieces are required:

cmake .. -DDUPLO_VERSION='"0.10.0"'
add_compile_definitions(DUPLO_VERSION=${DUPLO_VERSION})

Output real block length (not only filtered)

The following output says "set LineCount="10"" where '10' is "effective" length of the duplicated block (i.e. excluding empty lines, comments, defines etc.).

It would be nice to output real block lengths as well (i.e. with no excludes), for example like this:

<set LineCount="10">
<block SourceFile="/home/osboxes/Work/qvge/src/3rdParty/ogdf-2020/src/ogdf/uml/PlanRepUML.cpp" StartLineNumber="1201" LineCount="13"/>
<block SourceFile="/home/osboxes/Work/qvge/src/3rdParty/ogdf-2020/src/ogdf/uml/PlanRepUML.cpp" StartLineNumber="903" LineCount="15"/>

Single file duplication not detected

This file duplication is not detected:

AAAAA
BBBBB
CCCCC
DDDDD
EEEEE
/* some comment to offset the line numbers */
AAAAA
BBBBB
CCCCC
DDDDD
EEEEE

I've traced the problem into Duplo.cpp.

Output:

Loading and hashing files ... 1 done.

tests/Simple/LineNumbers.cMinBlockSize: 1
Found match at 0
Found match at 11
Found match at 22
Found match at 33
Found match at 44
Found match at 55
Found match at 66
Found match at 77
Found match at 88
Found match at 99
Should be here
line1: 0 line2: 0
Found match at 50
Found match at 61
Found match at 72
Found match at 83
Found match at 94
Should be here
line1: 5 line2: 5
 nothing found.

bug.diff.txt

Ensure at least minimum number of lines in analysis

Change the algorithm to this:

// support reporting filtering by both:
// - "lines of code duplicated", &
// - "percentage of file duplicated"
const unsigned int lMinBlockSize = std::max(
    m_minBlockSize, std::min(
        m_minBlockSize, 
        (std::max(n,m)*100)/m_blockPercentThreshold
    )
);

Try to use the Clamp function.

Docker image

Like this?

FROM alpine:3.11 AS build

RUN apk --no-cache add \
    alpine-sdk cmake

RUN mkdir -p /usr/src/ && \
    git clone https://github.com/dlidstrom/Duplo /usr/src/Duplo

WORKDIR /usr/src/Duplo

RUN mkdir build && cd build && cmake .. && cd .. && make

FROM scratch

WORKDIR /app
COPY --from=build /usr/src/Duplo/build/duplo .

ENTRYPOINT ["./duplo"]

See this.

XML Output is broken

There are no xml heading tags written into output.xml.
I'm suggesting the following change in Duplo.cpp in order to fix the issue:

@@ -350,6 +350,14 @@
     }
 
     std::cout << "Loading and hashing files ... " << std::flush;
+
+    if (options.GetOutputXml()) {
+        outfile
+            << "<?xml version=\"1.0\" encoding=\"UTF-8\"?>"
+            << std::endl
+            << "<duplo>"
+            << std::endl;
+    }
 
     auto lines = LoadFileList(options.GetListFilename());
     auto [sourceFiles, matrix, files, locsTotal] =

P.S. Thank you for the great software :)

Why multiple duplicates hasn't been merged?

for example:

file_a.c line 100 duplicates to file_a.c line 200 and file_a.c line 500 and file_a.c line 600, expecte the result to be

file_a.c line 100 file_a.c line 200 file_a.c 500 are duplicated other than:

line 100 duplicates to line 200, line 500 duplicates to line 600.

Unique Duplicate LOC count

I was doing some testing with Duplo. There is situation where my Total_lines_of_code is smaller than Duplicate_lines_of_code.

Duplo parameters configuration

Hi! Is there a way to change the configuration parameters? (e.g. changing the minimal block size, minimal blocks number) How can eventually be done?

Checking arbitrary files doesn't work

It currently skips FILETYPE_UNKNOWN files. I'd like to use it on any arbitrary text file.

This patch seems to work:

diff --git a/SourceFile.cpp b/SourceFile.cpp
index 18b5a80..fb6082e 100755
--- a/SourceFile.cpp
+++ b/SourceFile.cpp
@@ -63,7 +63,8 @@ SourceFile::SourceFile(const std::string& fileName, const unsigned int minChars,
                 }
             }
         }
-        if (FileType::FILETYPE_VB == m_FileType) {
+        if (FileType::FILETYPE_VB == m_FileType ||^M
+            FileType::FILETYPE_UNKNOWN == m_FileType) {^M
             tmp = line;
         }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.