GithubHelp home page GithubHelp logo

jhkst / equalff Goto Github PK

View Code? Open in Web Editor NEW
1.0 3.0 1.0 47 KB

Equal (duplicate) file finder - fast algorithm

License: GNU General Public License v3.0

Makefile 1.64% C 93.29% Roff 5.08%

equalff's Introduction

Equalff

Equalff is a high-speed duplicate file finder. It employs a unique algorithm, making it faster than other duplicate file finders for a single run.

Compilation

To compile the project, simply execute make in the project root.

Usage

Usage: ./equalff [OPTIONS] <DIRECTORY> [DIRECTORY]...
Find duplicate files in FOLDERs according to their content.

Mandatory arguments to long options are mandatory for short options too.
  -f, --same-fs             process files only on one filesystem
  -s, --follow-symlinks     follow symlinks when processing files
  -b, --max-buffer=SIZE     maximum memory buffer (in bytes) for files comparing
  -o, --max-of=COUNT        force maximum open files (default 20, 0 for indefinite open files)
  -m, --min-file-size=SIZE  check only file with size grater or equal to size (default 1)

Example:

$ equalff ~/

Result:

Looking for files ... 123 files found
Sorting ... done.
Starting fast comparsion.
...
file1-location1
file1-location2
file1-location3

file2-location1
file2-location2
...
Total files being processed: 123
Total files being read: 10
Total equality clusters: 2

Algorithm

  • An equality cluster is a set of files that are identical at a certain comparison stage.
  1. Files are sorted by size in ascending order.
  2. Files of the same size are added to an equality cluster. (Clusters with only one file are skipped.)
  3. Files in an equality cluster are compared in byte-blocks, starting from the beginning of the file.
  4. Based on the comparison results, the equality cluster is divided into smaller clusters using the union-find algorithm with a compressed structure.
  5. The comparison process is repeated until the end of the files.
  6. Finally, the clusters are printed to stdout.

Pros and Cons

Pros

  • Generally faster than other tools as it only reads necessary bytes(blocks).
  • Each file is read only once.
  • Does not use a hash algorithm, reducing processor computation.

Cons

  • Does not compute file content hash, so results cannot be reused.
  • Has some memory limitations, making it unsuitable for systems with limited memory.
  • Does not read the last bytes in the first comparison stage, where the probability of inequality is high, slightly slowing down the process.
  • On some filesystems (like FAT), it's not possible to open more than 16 files at once. This can be adjusted with the --max-of option.
  • Not tested with hardlinks and sparse files.
  • Not parallelized.
  • May be slower than other utilities on non-SSD disks due to file fragmentation. See #1

Comparison

TBD

utility name language exact command average time(*)
fslint python
dupedit python
fdupes C
finddup perl
rmdupe shell
dupmerge C
dupseek perl
fdf C
freedups perl
freedup C
liten python
liten2 python
rdfind C++
ua C++
findrepe Java
fdupe perl
ssdeep(**) C/C++
duff C
equalff C

(*) Average time of multiple runs on one path
(**) Output may not be exact (to be checked)

Licensing

See the LICENSE file for licensing information.

Plans

Plans are in place to add selection and action features for found clusters, such as "remove last modified", "remove first modified", "keep files starting on path", etc.

equalff's People

Contributors

jhkst avatar

Stargazers

Artifex Maximus avatar

Watchers

Michal Bernhard avatar  avatar  avatar

Forkers

nasuku

equalff's Issues

Algorithm is slower on more fragmented files on ext(4)

Having two files: FILE-A.dat and FILE-B.dat. They are both the same and are present in two directories:

test/FILE-A.dat
test/FILE-B.dat
test2/FILE-A.dat
test2/FILE-B.dat

a they have different fragmentation and are located on classical (non-ssd) disk:

$ filefrag test/* test2/*
test/FILE-A.dat: 3 extents found
test/FILE-B.dat: 9 extents found
test2/FILE-A.dat: 146 extents found
test2/FILE-B.dat: 116 extents found

To make comparsion I firstly run cat, sha1sum, cmp and equalff and get time of processing (I run it till time stables):

$ time cat test/* > /dev/null
real    0m27.785s
user    0m0.060s
sys     0m3.484s
$ time sha1sum test/* > /dev/null
real    0m27.729s
user    0m14.076s
sys     0m1.632s
$ time cmp test/*
real    0m32.421s
user    0m5.200s
sys     0m3.248s
$ time equalff test/
...
real    0m32.000s
user    0m0.560s
sys     0m3.596s

$ time cat test2/* > /dev/null
real    0m28.099s
user    0m0.076s
sys     0m3.552s
$ time sha1sum test2/* > /dev/null
real    0m27.541s
user    0m13.864s
sys     0m1.644s
$ time cmp test2/*
real    1m51.357s
user    0m5.524s
sys     0m3.684s
$ time equalff test2/*
...
real    1m52.242s
user    0m0.636s
sys     0m4.372s

for sure check count of read operations:

command strace - read calls read block size real time read sys sec read diff sec
cat test/* 20473 131072 0m27.785s 0.759073 27.172596
cat test2/* 20473 131072 0m28.099s 0.772224 27.197842
sha1sum test/* 81879 32768 0m27.729s 0.022936 18.424288
sha1sum test2/* 81879 32768 0m27.541s 0.025639 13.248825
cmp test/* 655005 4096 0m32.421s 0.067137 21.149980
cmp test2/* 655005 4096 1m51.357s 0.389145 100.477507
equalff test/ 81881 32768 0m32.000s 0.211431 29.616661
equalff test2/ 81881 32768 1m52.242s 0.718076 107.895899

read sys sec = strace -c
read diff sec = strace -c -w

So it seems that there is some performance difference in different reading attitude of these files.

  1. need to explain it
  2. if possible, make some solution

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.