ravn-tech / hypertag Goto Github PK

HyperTag - Intuitive Knowledge Management WebApp & CLI for Humans using Deep Learning & Tags

License: Other

Python 83.10% CSS 3.31% HTML 3.33% JavaScript 10.26%

tags tagging filesystem file organization semantic-similarity search-text pdf search search-engine

hypertag's Issues

Support relative file paths

This will enable to sync the hypertag.db across different machines / devices, while still working with relative file paths.

Save text tokens per document / page

Add new tables:

text_tokens: file_id, page_id, token_id
tokens: token_id, name

Add metatagging support to normal tagging using parent/children/baby syntax

Speed up semantic search using spatial index DS

Use a spatial index data structure (tree or hash based) -> https://github.com/nmslib/hnswlib/

Evaluate CLIP performance on text to text similarity

If CLIP performs as good as DistilBERT, there is no need for DistilBERT anymore.

Add option to merge two tags

$ hypertag merge A into B

Moves all file association from A to B

Add automatic file tagging by file type

Auto tag file with extension (type), e.g. JPG, PNG, TXT, PDF, PY, JS
Auto tag file with group, e.g. Image (JPG, PNG), Document (TXT, PDF), Source (PY, JS)

Add CPU / GPU toggle option

Currently things stop working if no CUDA GPU is available. This is bad. Make CUDA optional (allow CPU only usage). Looks like CLIP does not work without CUDA...

Semantic search for text documents

Vectorize all text documents and let the user search them.

Related to #24 and #9

Just eyeballing: Glove model (average_word_embeddings_glove.6B.300d) seems to perform better than DistilBERT (stsb-distilbert-base), add some small benchmark tests with common and diverse papers and queries.

Models:
https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0

Add test cases

Test basic functions that are unlikely to change behavior:

add file
import directory
add tag
add metatag
query
index (check text cleaning works for challenging file examples for pdf, html, etc.)

Add CLI option children

Print all children of a tag

Semantic search for images

Allow to search for image files using both text and images as queries

Extend daemon process to load the semantic search model and serve as single oracle

Needs fast and reliable IPC to work out

Add automatic file tag suggestions by file content using Machine Learning

When a new file is added, automatically infer tags from semantically similar existing files tags.

Depends on #24

Evaluate image to text search

Improve query UX using fuzzy word matching

Fuzzy String Matching: https://github.com/seatgeek/fuzzywuzzy

Related to #9

Identify file duplicates

Add hash and size columns to files table.
On add: compute hash and size -> Ignore duplicates.

HyperTagFS: Let user create directories with names as queries

Use Case: User creates a directory named: animal minus human -> directory should contain all files associated with animals minus human files.

Depends on #10 & #18

Add migration (import) option for TMSU users

Add FS hooks to HyperTagFS dir to detect file moving / deleting (map into DB)

Watchdog looks like what we need: https://pythonhosted.org/watchdog/quickstart.html

Add option --auto to import

This will tell the daemon to automatically watch the imported directory for new files and renames.

Visualize the HyperTag graph

Candidates:

https://graph-tool.skewed.de/static/doc/quickstart.html
- Pro: Performance (fast -> C++ wrapper)
- Con: Size (big), no pip install (cuz C++)
https://github.com/networkx/networkx
- Pro: well tested
https://github.com/igraph/python-igraph
- Pro: Performance (fast -> C wrapper)
https://github.com/root-11/graph-theory
- Pro: Size (tiny)
- Con: Performance (slow?)

Add files automatically when tagging

Update HyperTagFS dir lazily

Right now the whole HyperTagFS directory gets rebuild on every tag changing operation. Instead only make partial updates.

Add auto index option

Add new columns auto_index_images, auto_index_texts to auto imports table

Add indices to improve SQLite query performance

Speed up vectorization with batch processing

Add image search to HyperTagFS

Create a dedicated directory called "Search Images". All directories names created in "Search Images" are interpreted as search queries for image files and accordingly populated with the results.

Semantic search for individual text document pages

Right now text documents are represented as a single average embedding of all their sentences. Increase granularity / signal by vectorizing individual pages.

Related to #25

Improve text search by matching tokens

Text search happens right now only in vector space and thus ignores exact query token matches (which are a high signal though).

Depends on #32

Add text search to HyperTagFS

Create a dedicated directory called "Search Texts". All directories names created in "Search Texts" are interpreted as search queries for text documents and accordingly populated with the results.

Evaluate textract for audio files

https://textract.readthedocs.io/en/stable/#currently-supporting

Add remove file/s function

Add semantic video search

First basic version: Partition video into e.g. 16 uniformly spaced (by time) sections and take a screenshot. Embed each screenshot and use average as video embedding.

Advanced: Partition video with higher granularity and extract frames e.g. every 5 seconds or fixed high number (+100). Compute embedding for every extracted frame. Compute pairwise consecutive frame distances in embedding space to infer semantically coherent video sections (similar frames). Embed each section as average of coherent frames (below a threshold). The list of average frame embeddings should be a pretty good representation of the video and comes with section start & end metadata.