deduplicator's Introduction

deduplicator

a simple file deduplicator for a particular use case: you have files across 2 locally-accessible locations with mostly the same hierarchical structure, say a reference directory and a target directory; you want to remove all files within target that have identically-pathed and identically-hashed files already existing in reference.

Programs like fdupes do not care about hierarchical location, so it will flag e.g. __init__.py and other often zero-byte files as dupes.

This program is mostly useful for verifying and clearing backups, when the taget directory is expected to have a high degree of similarity with the reference directory, but you want to be extra sure. In this case you can:

# scan for files in local directory and output the paths and hashes to a reference file on the remote
[local]$ deduplicator -refDir my/files | ssh [email protected] 'cat > /tmp/ref.yml'
[local]$ ssh [email protected]
[remote]$ deduplicator -refYaml /tmp/ref.yml -targetDir /backup/my/files -deleteFiles

on the remote, the program will read the relpaths and hashes, compare them to the yaml file, confirm, then delete the duplicate files. If you abort the deletion, it will print a "deletion plan", which is all the rm statements you can use to manually delete the dupes.

Recommend Projects

whacked / deduplicator Goto Github PK

deduplicator's Introduction

deduplicator

deduplicator's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs