GithubHelp home page GithubHelp logo

gufi's People

Contributors

btslade avatar bws avatar calccrypto avatar dmanno avatar garygrider avatar jklim1015 avatar johnbent avatar jti-lanl avatar nzrutman avatar piyushthange avatar prajwalchalla avatar sfpwhite avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gufi's Issues

Man pages

Need man pages for all utilities.

gufi_find uses DEFAULT_CONFIG_PATH which isn't defined

WIth the current build gufi_find looks for a config attribute DEFAULT_CONFIG_PATH which doesn't exist, it looks like this should be DEFAULT_PATH but not sure, but it breaks gufi_find while gufi_ls works just fine.

Special characters in file/path names

Hello,

I didn't see any references to special characters in file names, so I thought it would be best to raise another issue (don't shoot the messenger, ha!).

I'm seeing messages related to opening a DB when there is a '%' in the file name (users going to user; probably doesn't know that's a reserved character in SQL):
image

I can always weed out that particular user's path if need be. At any rate, I figured the developers would want to know if they didn't already.

Thank you!
John DeSantis

RPM naming

Fix rpm naming to the correct release, arch, etc.

totsubdirs in summary tables

I'm wondering whether it might be useful to include totsubdirs in the summary tables to complement totfiles and totlinks.

How to update the metadata?

hi,

when I want to use GUFI, I always want to have the newest metadata lists. Do I need to execute gufi_dir2index whenever I want to check the newest metadata? Or there have some methods can update the metadata in a specific period without executing command manually?

gufi_stat?

Today we depend on --select in gufi_find to probe for more info in the dbs. This is fine for now, but some have suggested we may need a gufi_stat command that will get you the normal stat output and format accordingly.

Cleanup OSX regression test gufi_stat and verifytrace

So far found one issue related to gufi_stat. closedir on null causes crash in OSX and not on linux, moving closedir into the successful opendir block solves one issue. More issues exist (potentially related -- passing in non-existent dir)

Recursive gufi_ls does not work well with root databases

gufi_ls -R on the root directory breaks if the database file in the root directory was made by gufi_dir2index -z 0 /search /search since the recursive query joins on pinodes, and the /search directory will have an inode that is not the parent of any of the subdirectories.

Tree database files are not always opened in READONLY mode

When gufi_query is called with -O (per thread database output files) or -e 0 (aggregate), the database files are opened through ATTACH, not sqlite3_open_v2 and can be modified with queries, even though they should have been opened in READONLY mode.

$ src/gufi_query -O out -E "create table tree.ABC (name TEXT)" /search/
$ sqlite3 /search/db.db ".tables"
ABC

gufi_find -type d not handled correctly

As of now we aren't handling type d at all. We fix using the summary table for each directory and be sure to use this same table for the --select and other flags following type d.

Memory usage/leak

Hello,

First and foremost, thank you for this software! It works incredibly well, and we're seeing significant differences querying the index file system (multiple PB's) vs. using the traditional find command.

I've run into an issue, unfortunately, in terms of memory usage during a re-index process. Basically, the gufi_dir2index process consumes too much memory leading to system instability. The test system in question has 256 GB of RAM available and the re-index has caused many OOM events.

Are there any controls and/or configuration options I can set on the GUFI side in order to not consume all available memory, leading to OOM events? For what it's worth, there doesn't seem to be a correlation between the number of files and/or space consumed on the file system that we're indexing. I built GUFI from the latest commit at the time, which was a8d9328.

Thank you!
John DeSantis

Exclude .snapshot[s] directories eg GPFS

Feature request: When indexing systems that have publicly viewable .snapshot or similar paths it complicates queries or causes confusion with multiple counting files etc.

An option would be to just delete .snapshot after the scan but that still means that GUFI indexed that entire path.

Thanks!

FEATURE REQUEST: gufi_stats -c dirsize-log2-bins

The new commit which adds gufi_stats -c filesize-log2-bins is awesome. Thanks so much for adding that!

Would it be possible to add a similar gufi_stats -c dirsize_log2_bins please? Or let me know if you think that's something I might be able to do. Might be a good learning exercise for me. Be nice to get my name into the git commit history for GUFI also!

gufi_find -o rename

find uses the fprintf flag for the output to file functionality. gnu find uses -o as the OR flag.

Issue with gufi_ls

Special chars are not handled in gufi_ls cmd, e.g. if my directory name containing special char like (dir#1@$) then gufi_ls is failing and not giving expected output.
I also verified by creating dirs having special chars. but gufi_ls is failing only dir. which are having '#' with their name.

Installation instructions are unclear

Documentation can be found at [GUFI.docx](GUFI.docx) and in the Supercomputing 2022 paper:

In looking to try out GUFI I've reviewed the repository thoroughly alongside various papers and presentations.
I got side tracked for quite a while trying to follow the installation instructions in this GUFI.docx file, which I realise now must be quite outdated.

After searching the repo on github for references to make I realised there were other references to cmake and in the end found the Quick Start instructions in README.md. I should have paid closer attention at the start, guess I overlooked it.

It would help to include a reference to that Quick Start section here, and a note in this page that the GUFI.docx is out of date.

gufi does not escape paths with special chars

If a directory tree has special chars GUFI will throw warnings and skip those files

eg # or ?

Cannot attach database as "tree": unable to open database: file:/tmp/GUFI/wgsa/cgillies/wonderland/home/cgillies/workspace-ggts-3.4.0.RELEASE/genevetter_backup/node_modules/bower/node_modules/inquirer/node_modules/cli-color/node_modules/es5-ext/test/reg-exp/#/replace/db.db?mode=ro

Error: no such table: entries: /tmp/GUFI/KStorageLite/2016_DropSeq_Run1445/SingleCellData/KPMP_DiseaseBiopsies/Repeats_NotUsed/S-1908-000945-B_?/Sample_1125-EO-2/SC_RNA_COUNTER_CS/SC_RNA_COUNTER/SC_RNA_ANALYZER/CHOOSE_DIMENSION_REDUCTION_OUTPUT/fork0/chnk0-udc60c05d5e/files/pca_csv/10_components/db.db: "       INSERT INTO sument select uid, name, size, atime,       CASE WHEN size > 100 * 1024 * 1024 THEN size ELSE 0 END AS oversize     FROM entries    WHERE type='f';"

The queries still run but the logs are reports are flooded with these messages and the data in those folders are skipped in the query.

Longitudinal Study: Phase 2

Once we have #149 completed, we'll need to develop a process by which we capture a longitudinal snapshot from the GUFI tree. There is both an initial snapshot process and there is potentially an incremental snapshot process. For example, here is one proposal:

  1. Create an initial longitudinal snapshot by doing a roll-up of all the vrsummary tables into a single vrsummary table which includes a depth column and a pinode column.
    1.1. However, maybe this is not part of the longitudinal snap but is part of the GUFI tree build
  2. Create a copy of this single augmented rolled-up summary table (SARUST).
  3. At some future time, create a new copy of the SARUST.
  4. Potentially reduce savings costs by reducing one of the copies into an increment off the other
    4.1. This should be a lot of savings since there may be 1B entries in the table (1B directories in a GUFI tree) but only a very small number of them will have changes between longitudinal snapshots

So, in this issue, we can define how we capture the initial snapshot as well as how we create the increment.

FEATURE REQUEST: Add totzero to the vrsummary table

The most recent commit adding filesize distro to gufi_stats is very cool! Thanks for adding that.

I also noticed that there are a few useful predefined bins in vrsummary such as totltk,totmtk,totltm,totmtm,totmtg,totmtt. Would it be possible to add totzero as well so we can easily see the count of zero-byte files? Also, maybe a bit more granularity in the bins? Or, is the new gufi_stats -c filesize-log2-bins actually all that is needed here. Is it just as quick to use gufi_stats to get this as it is to query those fields in vrsummary?

Also, would it be possible to add zero-size files into the gufi_stats -c filesize-log2-bins output please?

googletest causes builds to fail

When using the current repo building fails with:

/tmp/GUFI/contrib/deps/googletest.sh: line 21: cd: /tmp/GUFI/build/builds/googletest-main: No such file or directory
make[2]: *** [CMakeFiles/install_dependencies.dir/build.make:57: CMakeFiles/install_dependencies] Error 1
make[1]: *** [CMakeFiles/Makefile2:375: CMakeFiles/install_dependencies.dir/all] Error 2
make: *** [Makefile:141: all] Error 2

A quick fix for this is to delete the contrib/deps/googletest.tar.gz the google test script then downloads from the web and works just fine with a follow up make.

Longitudinal Study: Phase 1

We want to enable a longitudinal study. This requires multiple phases. This issue is about the first phase. Subsequent issues will be created to describe subsequent phases so we can keep the conversations nicely organized.

We need to decide on what to capture in a longitudinal snapshot. One option would be merely to create a snapshot of the full GUFI tree but that's very large and includes a lot of information that wouldn't be necessary for a longitudinal study. Another option is to create a tool which copies a subset of the GUFI tree. It could either copy a subset of the GUFI tree files or it could do GUFI queries and save the output of those queries. That particular question is perhaps a subsequent phase.

Regardless of how we capture a longitudinal snapshot, we also have realized we first want to add additional information into the GUFI tree. Especially we want to capture histograms describing attributes of the entries within a directory into the summary tables. For example, how many files are size [0,1), [1-2), [2-4), etc. To collaboratively decide on the histograms required, we are defining that in this spreadsheet:

https://docs.google.com/spreadsheets/d/17PSZxHLVj731bI9PKY3E1V-TzSAHJiIahioc2nujo1U/edit#gid=0

Is there any way to update existing index directory?

I'm using "gufi_dir2index" to create index directory. I've done some file operations on my input directory's (i.e file system) sub-directory.
Now I want to update my existing index directory.

Is there any way to update existing index directory, please suggests.

Thank you!

Issue with gufi_find command

With -size option if we pass '-ve' number along with size in k, M or G then it's giving an error,

example command which I tried,
gufi_find -size -10M

Handling multiple inputs to gufi_ls

Currently, multiple inputs to gufi_ls are processed one at a time. When the -S flag is used, the output of each input is sorted, but the expected output is all outputs being sorted as one set of data.

gufi_dir2trace race condition

gufi_dir2trace can race on stdout when no per-thread-output prefix is provided. worktofile needs to be changed to buffer in OutputBuffers before writing to the actual file. processdir needs to be changed to atomically write stanzas.

Issues with folders with spaces

We're having issues with folders with spaces in names:

eg

Project_RNA-Seq 5-AZA_2

Will create error messages like:

input-dir '/tmp/GUFI/umms-cspeers//Project_RNA-Seq' is not a directory
input-dir '/tmp/GUFI/umms-cspeers//5-AZA_2' is not a directory

The query being run is located: https://github.com/umich-arc/gufi-archive/blob/master/reports/dirsum.sh

I can confirm that the index does have the path correctly. The error appears to be in the query but simpler queries like
https://github.com/umich-arc/gufi-archive/blob/master/reports/totals.sh

Using GUFI Index located in: /tmp/GUFI/umms-cspeers/
Reporting on data older than 180 days last accessed

count                     sizeGB  oldsize  percent  
137661                    20173   5992     29%   

Work just fine.

gufi_ls updates

We've modified gufi_find quite a bit and gufi_ls has fallen behind some. Formatting definitely needs to be reworked to feel more like ls.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.