iemejia / catho Goto Github PK
View Code? Open in Web Editor NEWA file catalog utility inspired by the awesome Robert Vasicek's Cathy project. Or my excuse to hack something that I really need.
License: GNU Lesser General Public License v3.0
A file catalog utility inspired by the awesome Robert Vasicek's Cathy project. Or my excuse to hack something that I really need.
License: GNU Lesser General Public License v3.0
like 'git gc', it executes ANALYZE; and then VACUUM; in sqlite
The import:
from utils import get_file_info
throws an error with python 3
--help -h
similar to os.path, but for catho paths.
including debug and verbose options.
a runtime service that automatically indexes modified files and upgrades the catalog.
for security reasons in case of use in cloud servers
-e --encrypt
we have to decide a schema
add -c to enable the completion of an unfinished indexing, or the update of a catalog in case of changes.
To add, or change values such as basepath
something like -set-meta key, value
When indexing really long volumes
if a file name includes the character ' in it it throws an error when it builds the query.
Implement a Cathy (.cat) importer
$ catho/catho.py ls
Traceback (most recent call last):
File "catho/catho.py", line 334, in
logger.info(catalogs_str())
File "catho/catho.py", line 226, in catalogs_str
date = str(datetime.fromtimestamp(timestamp))
TypeError: a float is required
We have to discuss what to support:
When saving catalog, save the full path in the metadata.
And only relative paths in the catalog table.
Currently it works for ., but it fails with the system auto completed paths (e.g. ~)
Based on the idea of:
https://github.com/seabre/finddupes/blob/master/finddupes.sh
SELECT hash, location FROM files WHERE hash NOT IN (SELECT hash FROM files GROUP BY hash HAVING ( COUNT(hash) = 1 )
something similar to .cathoconfig
that allows the minimum possible configurations, e.g. string output format
Writing with the logger to stderr by default prevent of using unix tools like grep for filtering, for example $ catho/catho.py find \* media | grep Animal
just print 133652 records from the test catalog.
There is a workaround, redirecting stderr to stdout, but it is not really straight forward.
$ catho/catho.py find \* media 2>&1 >/dev/null | grep Animal
try to index e.g. /file/file and boom, the problem is in the path_block_iterator, in the way it builds the path names.
A function to detect repeated folders (if n hashes are equal per folder there is a possiblitiy).
How to optimize ?
Probably a file similar to .gitignore, because folders like .git or .svn can return many false positives.
I found it trying to index a VM that's bigger than the memory in my local machine:
Python(15742) malloc: *** mmap(size=140645843243008) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
File "catho/catho.py", line 303, in
for files in filesubsets:
File "catho/catho.py", line 77, in file_get_filelist
hash = file_hash(fullpath, hash_type)
File "catho/catho.py", line 50, in file_hash
sha1.update(f.read())
MemoryError
Not such file or directory while parsing a hardlink
$ ls -al ~/Dropbox/iPad/
lrwxr-xr-x 1 rgamez staff 38 18 feb 2012 test.pdf -> /Users/rgamez/Documents/test.pdf
$ ./catho.py add iPad ~/Dropbox/iPad
Creating catalog: iPad
An error occurred: [Errno 2] No such file or directory: '/Users/rgamez/Dropbox/iPad/test.pdf'
Following discussion #7 (comment)
Remove the global catalogs variable and put it in an object
Create class for the DAO operations (maybe the same)
Optional: create option -f to force creation.
output of catho ls should be more consitent in columns, something like ls -l
something to remove useless files (.DS_Store, *.url, Torrent downloaded from, etc)
with the ls command plus the name of the catalog, it should display metadata and eventually the contents.
It could be a textual or graphical gui
Tutorials:
if ~/.catho
doesn't exist because catho init
hasn't been executed, errors are displayed but the parsing of the directory continues.
$ ~/catho/catho/catho.py add Home ~
Creating catalog: Home
An error occurred: unable to open database file
An error occurred: unable to open database file
An error occurred: unable to open database file
$ ls ~/.catho
ls: cannot access ~/.catho: No such file or directory
Two options/suggestions @iemejia
catho init
.A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.