kindfs

Index FS into a database, then easily make queries e.g. to find duplicates files/dirs, or mount the index with FUSE.

Files are scanned with their metadata (e.g. file permissions, size, etc), filetype (from libmagic), and hashed contents (from xxhash). A pseudo-hash is then generated for dirs based on the dir contents (hashes from subfiles and pseudo-hash from subdirs). Indexing should properly support (and not fail) when a part of the filesystem uses a different encoding (e.g. an old backup)

It is assumed that two files or dirs are "likely identical" when they have the same size and same hash (see "DISCLAIMER" text below !).

When two dirs are identical, quite often their parent dirs are "almost identical", so it is useful to check whether all files from one side might have a corresponding duplicate on the other side. Or more generally you may wish to progressively sort your data into a clean/ folder, and check some other dirs (let's call it mess/) to know (regardless of the subfiles/subdirs structure) which subfiles/subdirs are already present in clean/ and which are not. The command isincluded helps for that purpose.

Since a full index of the filesystem is made in the DB, it might also be useful for other purposes as well (e.g. detecting file corruption)

At the moment, the only DB supported is SQLite3 and the only OS supported is Linux (especially with regards to FUSE, although this might easily be ported to BSD and MacOS).

Usage

A prerequisite is to have the following packages installed : pip install xxhash numpy fusepy python-magic

Then the CLI has the following options : kindfs <dbfile> <command> [args]

kindfs <dbfile> scan <path> [-R] [-z] : creates the sqlite file dbfile and scans the filesystem under path/. For performance, it is strongly adviced to have the DB on an SSD or in RAM. Of course the user launching kindfs must have the proper access rights to scan the filesystem (an incomplete scan may occur otherwise). Adding -R performs a reset (if the DB file was already there) while otherwise the program attempts to resume scanning if it was previously started and aborted for any reason (useful for very large filesystems where scanning takes hours). If -C num is specified, hashes are calculated on chunks of num first/middle/last kilobytes of files (default is 1 MB otherwise. Shorter values make the scan faster but further analysis less reliable). If num is 0 or negative, the whole filesize is used: this makes the scan a lot slower (especially if there are many large files on the scanned filesystem) but further results on duplicate detection will be more reliable (see "disclaimer" below)
kindfs <dbfile> showdups [-l num] [-m <mountpoint>] [-k] : shows all duplicate entries (files or dirs), by decreasing occupied size (i.e. decreasing size*(number_duplicates-1)). if -n is specified, result is ordered by decreasing number of childs (for dirs) instead of size. If -l num is specified, only the num first entries are returned. Adding -m <mountpoint> enables to do an additional check on the filesystem in case some files would have been deleted in the meanwhile.
kindfs <dbfile> isincluded <dir1> <dir2> [-m <mountpoint>] [-i] [-n] [-g <glob_ignore_file>] : checks which files from dir1 are present in dir2 (regardless of files/dirs structure). -i shows files that in dir1 are included in dir2, while -n shows files in dir1 that are not included in dir2 (at least -n or -i must be specified). Adding -m <mountpoint> enables to do an additional check on the filesystem in case some files would have been deleted in the meanwhile. If -g is specified, the glob_ignore_file argument enables to specify entries the be ignored in the output (similar to piping the output with grep -v -f <regexfile> but easier to write and supposedly faster since it is processed within sqlite and using simple shell wildcards instead of regex),
kindfs <dbfile> diff <dir1> <dir2> : equivalent to kindfs <dbfile> isincluded <dir1> <dir2> -n + kindfs <dbfile> isincluded <dir2> <dir1> -n i.e. will display all files from one side that are not on the other side (regardless of dir structure)
kindfs <dbfile1> comparedb <dbfile2> : will display all entries from dbfile1 that are not on the dbfile2 (i.e. compares two versions of a DB, regardless of the file/dir structure). Useful to double-check that re-organizing files and deleting duplicate entries did not end up with mistakenly deleting some files that had no duplicates (e.g. in order to restore them from an older backup while it is still possible).
kindfs <dbfile> dump <dir> : dumps the contents of an indexed dir, with hashes and sizes in front of paths
kindfs <dbfile> inodes [-l num] [-m <mountpoint>] : same as showdups but only return entries with identical inodes (i.e. those do not "eat" space on the filesystem, yet this information can be useful when trying to reorganize data, + having hard links can be error-prone in some cases when you think you have independent files whereas they are effectively pointing to the same data...)
kindfs <filename> computehash : compute hash number with the same algorithm that is used when scanning the filesystem. Useful for some double-checks
kindfs <filename> check_zerofile : check if a file is 100% made of zeros (useful for some double-checks)
kindfs_fusemount <dbfile> <mountpoint> : mounts the DB under mountpoint/ using FUSE (read-only at the moment). Then any tools including diff -r and find can be used.
- Notice that st_size is mapped to the pseudo file contents generated by the FUSEFS (where files are all text files containing the hash+length of the real indexed file in the DB), and is different from the exposed size from st_blocks (which maps the real size of the file in the DB). This enables mc and du to report the real file size, while text editors can also properly read and show the summarized contents.
- By the way, let's remind that in some situations on usual filesystems such as ext4, files can have identical contents/size/filename and yet have a different size returned by du -s (but identical size returned with du -sb. Another way to see it is to compare with find -type f -printf "%8b %10s\t%p\n" and see that number of blocks differ despite real size is identical). This is because the default behavior for du is to rely on st_block, which may differ between identical files if the underlying filesystem is fragmented while du -b uses st_size i.e. the real file size
(more to come once it is better tested...)

DISCLAIMER

This tool has improved a lot recently yet it is still work-in-progress. It already works for many of my own use-cases, but I didn't put any version number so far, and will indicate v1.0 when it is more mature in my opinion (at least I should include some unit/non-regression tests).

You should read and understand the GPLv3 license (including their sections "Disclaimer of Warranty" and "Limitation of Liability"). In particular : although this tool should theoretically not write or modify anything on the filesystem (except the DB itself), it may contain bugs, return wrong informations (e.g. on duplicate files or dirs), behave in unintended ways. Use this software at your own risk, and it is adviced to double-check the results and never delete files/dirs unless you are sure of what you are doing !

In particular : like all hash algorithms, xxhash() may lead to collisions i.e. same hash for different data, even though probability is assumed to be low. However in addition to that and since scanning speed was also considered an important feature, the default behavior is to compute the hash only from the first/middle/last 1 MB of a file i.e. two different files of identical size may exhibit the same hash when they are identical on those 3x1MB segments (this should rarely happen for most file types, but one counter-example is VM files where this can frequently happen). This is why (among other things) you should always double-check when this program identifies and returns candidate duplicate files/dirs.

Since xxhash is used, the DB may also be useful to detect further random file corruption, however keep in mind that xxhash is not a cryptographic hash and therefore not suitable to detect file tampering (a proper algorithm such as SHA-3 ought to be used for that kind of purposes although not supported yet in this tool. BLAKE 3 is being considered for a future version of kindfs as a (supposedly) both secure and efficient alternative).

karteum / kindfs Goto Github PK

kindfs's Introduction

kindfs

Usage

DISCLAIMER

kindfs's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs