mar-file-system / gufi Goto Github PK

View Code? Open in Web Editor NEW

46.0 46.0 21.0 37.1 MB

Grand Unified File-Index

License: Other

C 43.39% Shell 20.81% Python 17.33% CMake 8.54% C++ 9.93%

gufi's People

Contributors

Stargazers

Watchers

Forkers

thewacokid hbc87111 jbd jiajiesun holmanb coltenb nzrutman prajwalchalla skyefox379 piyushthange markgalassi umich-arc btslade wangcy6 luckpeople vatelzh johnbent migeljanimeri jklim1015

gufi's Issues

Man pages

Need man pages for all utilities.

gufi_find uses DEFAULT_CONFIG_PATH which isn't defined

WIth the current build gufi_find looks for a config attribute DEFAULT_CONFIG_PATH which doesn't exist, it looks like this should be DEFAULT_PATH but not sure, but it breaks gufi_find while gufi_ls works just fine.

Special characters in file/path names

Hello,

I didn't see any references to special characters in file names, so I thought it would be best to raise another issue (don't shoot the messenger, ha!).

I'm seeing messages related to opening a DB when there is a '%' in the file name (users going to user; probably doesn't know that's a reserved character in SQL):

I can always weed out that particular user's path if need be. At any rate, I figured the developers would want to know if they didn't already.

Thank you!
John DeSantis

RPM naming

Fix rpm naming to the correct release, arch, etc.

totsubdirs in summary tables

I'm wondering whether it might be useful to include totsubdirs in the summary tables to complement totfiles and totlinks.

How to update the metadata?

hi,

when I want to use GUFI, I always want to have the newest metadata lists. Do I need to execute gufi_dir2index whenever I want to check the newest metadata? Or there have some methods can update the metadata in a specific period without executing command manually？

gufi_stat?

Today we depend on --select in gufi_find to probe for more info in the dbs. This is fine for now, but some have suggested we may need a gufi_stat command that will get you the normal stat output and format accordingly.

Cleanup OSX regression test gufi_stat and verifytrace

So far found one issue related to gufi_stat. closedir on null causes crash in OSX and not on linux, moving closedir into the successful opendir block solves one issue. More issues exist (potentially related -- passing in non-existent dir)

Recursive gufi_ls does not work well with root databases

gufi_ls -R on the root directory breaks if the database file in the root directory was made by gufi_dir2index -z 0 /search /search since the recursive query joins on pinodes, and the /search directory will have an inode that is not the parent of any of the subdirectories.

gufi_ls file

gufi_ls filename does not work

fb5a819 only works for directories

Tree database files are not always opened in READONLY mode

When gufi_query is called with -O (per thread database output files) or -e 0 (aggregate), the database files are opened through ATTACH, not sqlite3_open_v2 and can be modified with queries, even though they should have been opened in READONLY mode.

$ src/gufi_query -O out -E "create table tree.ABC (name TEXT)" /search/
$ sqlite3 /search/db.db ".tables"
ABC

gufi_find -type d not handled correctly

As of now we aren't handling type d at all. We fix using the summary table for each directory and be sure to use this same table for the --select and other flags following type d.

Memory usage/leak

Hello,

First and foremost, thank you for this software! It works incredibly well, and we're seeing significant differences querying the index file system (multiple PB's) vs. using the traditional find command.

I've run into an issue, unfortunately, in terms of memory usage during a re-index process. Basically, the gufi_dir2index process consumes too much memory leading to system instability. The test system in question has 256 GB of RAM available and the re-index has caused many OOM events.

Are there any controls and/or configuration options I can set on the GUFI side in order to not consume all available memory, leading to OOM events? For what it's worth, there doesn't seem to be a correlation between the number of files and/or space consumed on the file system that we're indexing. I built GUFI from the latest commit at the time, which was a8d9328.

Thank you!
John DeSantis

query to get Big directories with the help of GUFI db

Wonder if there are queries to list big directories with GUFI db.
Thanks !

gufi_stats update

Similar to gufi_ls (#8), gufi_stats has also fallen behind.

Exclude .snapshot[s] directories eg GPFS

Feature request: When indexing systems that have publicly viewable .snapshot or similar paths it complicates queries or causes confusion with multiple counting files etc.

An option would be to just delete .snapshot after the scan but that still means that GUFI indexed that entire path.

Thanks!

FEATURE REQUEST: gufi_stats -c dirsize-log2-bins

The new commit which adds gufi_stats -c filesize-log2-bins is awesome. Thanks so much for adding that!

Would it be possible to add a similar gufi_stats -c dirsize_log2_bins please? Or let me know if you think that's something I might be able to do. Might be a good learning exercise for me. Be nice to get my name into the git commit history for GUFI also!

gufi_find -o rename

find uses the fprintf flag for the output to file functionality. gnu find uses -o as the OR flag.

Issue with gufi_ls

Special chars are not handled in gufi_ls cmd, e.g. if my directory name containing special char like (dir#1@$) then gufi_ls is failing and not giving expected output.
I also verified by creating dirs having special chars. but gufi_ls is failing only dir. which are having '#' with their name.

Installation instructions are unclear

GUFI/docs/README.md

Line 6 in ae63a8c

 Documentation can be found at [GUFI.docx](GUFI.docx) and in the Supercomputing 2022 paper: 

In looking to try out GUFI I've reviewed the repository thoroughly alongside various papers and presentations.
I got side tracked for quite a while trying to follow the installation instructions in this GUFI.docx file, which I realise now must be quite outdated.

After searching the repo on github for references to make I realised there were other references to cmake and in the end found the Quick Start instructions in README.md. I should have paid closer attention at the start, guess I overlooked it.

It would help to include a reference to that Quick Start section here, and a note in this page that the GUFI.docx is out of date.

gufi does not escape paths with special chars

If a directory tree has special chars GUFI will throw warnings and skip those files

eg # or ?

Cannot attach database as "tree": unable to open database: file:/tmp/GUFI/wgsa/cgillies/wonderland/home/cgillies/workspace-ggts-3.4.0.RELEASE/genevetter_backup/node_modules/bower/node_modules/inquirer/node_modules/cli-color/node_modules/es5-ext/test/reg-exp/#/replace/db.db?mode=ro

Error: no such table: entries: /tmp/GUFI/KStorageLite/2016_DropSeq_Run1445/SingleCellData/KPMP_DiseaseBiopsies/Repeats_NotUsed/S-1908-000945-B_?/Sample_1125-EO-2/SC_RNA_COUNTER_CS/SC_RNA_COUNTER/SC_RNA_ANALYZER/CHOOSE_DIMENSION_REDUCTION_OUTPUT/fork0/chnk0-udc60c05d5e/files/pca_csv/10_components/db.db: "       INSERT INTO sument select uid, name, size, atime,       CASE WHEN size > 100 * 1024 * 1024 THEN size ELSE 0 END AS oversize     FROM entries    WHERE type='f';"

The queries still run but the logs are reports are flooded with these messages and the data in those folders are skipped in the query.

Longitudinal Study: Phase 2

Once we have #149 completed, we'll need to develop a process by which we capture a longitudinal snapshot from the GUFI tree. There is both an initial snapshot process and there is potentially an incremental snapshot process. For example, here is one proposal:

Create an initial longitudinal snapshot by doing a roll-up of all the vrsummary tables into a single vrsummary table which includes a depth column and a pinode column.
1.1. However, maybe this is not part of the longitudinal snap but is part of the GUFI tree build
Create a copy of this single augmented rolled-up summary table (SARUST).
At some future time, create a new copy of the SARUST.
Potentially reduce savings costs by reducing one of the copies into an increment off the other
4.1. This should be a lot of savings since there may be 1B entries in the table (1B directories in a GUFI tree) but only a very small number of them will have changes between longitudinal snapshots

So, in this issue, we can define how we capture the initial snapshot as well as how we create the increment.

FEATURE REQUEST: Add totzero to the vrsummary table

The most recent commit adding filesize distro to gufi_stats is very cool! Thanks for adding that.

I also noticed that there are a few useful predefined bins in vrsummary such as totltk,totmtk,totltm,totmtm,totmtg,totmtt. Would it be possible to add totzero as well so we can easily see the count of zero-byte files? Also, maybe a bit more granularity in the bins? Or, is the new gufi_stats -c filesize-log2-bins actually all that is needed here. Is it just as quick to use gufi_stats to get this as it is to query those fields in vrsummary?

Also, would it be possible to add zero-size files into the gufi_stats -c filesize-log2-bins output please?

googletest causes builds to fail

When using the current repo building fails with:

/tmp/GUFI/contrib/deps/googletest.sh: line 21: cd: /tmp/GUFI/build/builds/googletest-main: No such file or directory
make[2]: *** [CMakeFiles/install_dependencies.dir/build.make:57: CMakeFiles/install_dependencies] Error 1
make[1]: *** [CMakeFiles/Makefile2:375: CMakeFiles/install_dependencies.dir/all] Error 2
make: *** [Makefile:141: all] Error 2

A quick fix for this is to delete the contrib/deps/googletest.tar.gz the google test script then downloads from the web and works just fine with a follow up make.

Longitudinal Study: Phase 1

We want to enable a longitudinal study. This requires multiple phases. This issue is about the first phase. Subsequent issues will be created to describe subsequent phases so we can keep the conversations nicely organized.

We need to decide on what to capture in a longitudinal snapshot. One option would be merely to create a snapshot of the full GUFI tree but that's very large and includes a lot of information that wouldn't be necessary for a longitudinal study. Another option is to create a tool which copies a subset of the GUFI tree. It could either copy a subset of the GUFI tree files or it could do GUFI queries and save the output of those queries. That particular question is perhaps a subsequent phase.

Regardless of how we capture a longitudinal snapshot, we also have realized we first want to add additional information into the GUFI tree. Especially we want to capture histograms describing attributes of the entries within a directory into the summary tables. For example, how many files are size [0,1), [1-2), [2-4), etc. To collaboratively decide on the histograms required, we are defining that in this spreadsheet:

https://docs.google.com/spreadsheets/d/17PSZxHLVj731bI9PKY3E1V-TzSAHJiIahioc2nujo1U/edit#gid=0

Remove C-Thread-Pool

Some executables still use C-Thread-Pool instead of QueuePerThreadPool.

Is there any way to update existing index directory?

I'm using "gufi_dir2index" to create index directory. I've done some file operations on my input directory's (i.e file system) sub-directory.
Now I want to update my existing index directory.

Is there any way to update existing index directory, please suggests.

Thank you!

Project_RNA-Seq 5-AZA_2

Will create error messages like:

input-dir '/tmp/GUFI/umms-cspeers//Project_RNA-Seq' is not a directory
input-dir '/tmp/GUFI/umms-cspeers//5-AZA_2' is not a directory

The query being run is located: https://github.com/umich-arc/gufi-archive/blob/master/reports/dirsum.sh

I can confirm that the index does have the path correctly. The error appears to be in the query but simpler queries like
https://github.com/umich-arc/gufi-archive/blob/master/reports/totals.sh

Using GUFI Index located in: /tmp/GUFI/umms-cspeers/
Reporting on data older than 180 days last accessed

count                     sizeGB  oldsize  percent  
137661                    20173   5992     29%

Work just fine.

gufi_ls updates

We've modified gufi_find quite a bit and gufi_ls has fallen behind some. Formatting definitely needs to be reworked to feel more like ls.

mar-file-system / gufi Goto Github PK

gufi's People

Contributors

Stargazers

Watchers

Forkers

gufi's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs