tuw-geo / geopathfinder Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 2.0 1.15 MB

Querying and searching data on the file system

License: MIT License

Python 100.00%

file-manager geodata python remote-sensing

geopathfinder's Issues

check logic in get_disk_usage() of SmartTree

check if files that belong to multiple SmartPath() in one SmartTree(), are counted just one time in the disk usage.

Change versioning of the yeoda naming convention

The current yeoda naming convention is not alignment with our new versioning system. Please adopt the yeoda_path, so that it takes a new input argument version or data_version, instead of version and run_num. This argument should contain everything which is version related, i.e. the software version + run number and this needs to be properly set by the workflow or the user, not geopathfinder. This gives us also more freedom to work with other data sets having a different versioning scheme.

https://github.com/TUW-GEO/geopathfinder/blob/master/src/geopathfinder/naming_conventions/yeoda_naming.py#L260

Please also change the 'version_run_id' field to 'version' or 'data_version' everywhere.

Provide meta information about dimensions

Often when working with file collections from geopathfinder I want to have some meta information, such as the names of the dimensions and which ones are temporal or spatial dimensions, so I can load them into my datacube in a generic way.

Since it is during file loading we have all that information, this probably would be the right place to provide a filename class or something similar containing such meta information.

FileExistsError

Happens when simultaneous tasks attempts to create different files that should be in the same non-existent folder e.g. divided tasks on an HPC. One task will be successful, while all others will fail and give out FileExistsError when trying to create the directory.

File "..../lib/python3.7/site-packages/geopathfinder/folder_naming.py", line 133, in make_dir os.makedirs(self.directory) File "..../lib/python3.7/os.py", line 223, in makedirs os.makedirs(self.directory)

TestSgrtFilename() fails in Python 2.7

File "C:\code\sgrt\geopathfinder\geopathfinder\sgrt_naming.py", line 60, in init

super(SgrtFilename, self).__init__(fields, fields_def, ext='.tif')

TypeError: super() argument 1 must be type, not classobj'

"---" as orbit number should be valid in sgrt file naming

Only integer values are currently valid for the relative orbit number. But for sgrt parameters, where no orbit number is defined in the filename ("---"), the function decode_rel_orbit() in sgrt_naming.py is failing because it tries to cast the string to int.

remove .pytest_cache

Remove .pytest_cache and add it to .gitignore

windows installation issue - too long pathnames

Just ran into installation issues on windows due to very long file-names in the tests folder...

Removing the tests folder from the pypi sources should do the job
(and it would also significantly reduce the size of the library since tests are not required at runtime)

Collecting geopathfinder
  Downloading geopathfinder-0.1.4.tar.gz (1.1 MB)
     ---------------------------------------- 1.1/1.1 MB 9.6 MB/s eta 0:00:00

Pip subprocess error:
ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'C:\\Users\\---\\AppData\\Local\\Temp\\7\\pip-install-inlh1qw6\\geopathfinder_91a16088922149668c97e906b5085386\\tests/test_data/Sentinel-1_CSAR/IWGRDH/preprocessed/datasets/resampled/A0202/EQUI7_EU500M/E006N006T6/plia/M20160831_163321--_PLIA-----_S1AIWGRDH1--A_175_A0201_EU500M_E006N006T6.tif'
HINT: This error might have occurred since this system does not have Windows Long Path support enabled. You can find information on how to enable this at https://pip.pypa.io/warnings/enable-long-paths

python-dateutil version clashes with pandas

python-dateutil version is fixed (python-dateutil==2.6.1) but pandas now requires python-dateutil>=2.7.3

Is it possible to relax the version to python-dateutil>=2.6.1 or upgrade it directly?

Enhance geopathfinder

Currently, I identify several points for improving the class logic and making the package more "pythonic".

Folder naming conventions/classes and file naming conventions/classes are completely decoupled. A framework uniting both classes would make sense.
Add magic methods like __str__ or __add__ (e.g. adding a path to a tree) and properties like n_paths, n_files, or disk_usage to better interact with an object. Especially, replace functions doing printing, e.g. print_file_register and replace them with sth. like this https://pypi.org/project/seedir/
I would prefer to have stacked function calls, which always return self, e.g. tree.filter(level, pattern='..').filter(level, pattern='..').prune(level) and not having all these "collect" functions.
Temporary creation of data frames should be prevented. It would be better to have a central data frame dealing with folders and files (see get_disk_usage or search_files_ts)
Building a tree is quite slow at the moment, because it uses os.walk and does not utilise parallelisation.
Refactor build_smarttree in general - a lot of list appends happen there, even after one knows the "dimensions" of paths and folders.
Regex patterns should be used as a general entry point for filtering folders or file names, not starting from a tuple of strings.
Its currently quite difficult to understand how to use geopathfinder in detail. More docs and Jupyter Notebooks should be added.

This should just be the central issue collecting and discussing improvements or new ideas, which then can be distributed to other issues later on. Please feel free to add your ideas and thoughts - this should be considered as a brainstorming. If we come up with a specific set of tasks, we could also ask a student or a new employee to implement them.

And by the way: I did not find a package, which does already similar things - so this might be a huge benefit for the community!

Fix license.txt

Update license.txt with correct names

better control length of SmartFilenamePart()

e.g. at src/geopathfinder/file_naming.py in Line 267 length = end - start, this can get messy.

what is "compact"? make more clear!

Logfile directory level

In regard to the yeoda_path convention, the "logfiles" folder is at the same level as the "data_version" level at the moment. The advantage is that the level below "data_version" solely consists of sub-directories in a spatial context, e.g. different Equi7 continents.

However, in the context of job file logging under "logfiles", which are bound to a certain data version, we have the issue that they are hierarchically not connected with the different data versions anymore. This means if someone wants to move data produced with a specific version somewhere else, then it needs to be assured that the respective log files are also moved.

This issue could be solved by either:

Leaving as is - results in problem/inconsistency mentioned above.
Creating a new level below "data_version", e.g. consisting of "datasets" and "logfiles". This would thematically be the best separation, but introduces one more folder level.
Move the "logfiles" folder below "data_version", which would cause an inconsistency with the spatial context of this folder level.

paths to test files are too long for windows platforms

when trying to install, it fails because of this error:
unable to create file tests/test_data/Sentinel-1_CSAR/IWGRDH/preprocessed/datasets/resam
pled/A0202/EQUI7_EU500M/E006N006T6/sig0/qlooks/Q20160831_163321--_SIG0-----_S1AIWGRDH1VVA_175_A02
01_EU500M_E006N006T6.tif

please shorten this

Subpath Crawler

In order to automate data updates, some of our data packages (gldas, ecmwf_models, ...) should contain functions to go through existing data structures and determine what data (start date, end date etc.) is already stored and what data is missing. E.g. in the gldas package there are functions for that https://github.com/TUW-GEO/gldas/blob/75ca48f620c1b64d7c6246f081aaa6924834b7ff/gldas/download.py#L43 and https://github.com/TUW-GEO/gldas/blob/75ca48f620c1b64d7c6246f081aaa6924834b7ff/gldas/download.py#L119

Without looking too much into this package now, is this something that could fit here? It would be nice if I don't have to add functions as the ones above to all our packages because that would mean a lot of duplicate code.

file_num missing from eodr_naming

file_num is not any more a keyword in eodr_naming, which breaks functionality. Maybe it was deleted by mistake in the latest changes?
@claxn

tuw-geo / geopathfinder Goto Github PK

geopathfinder's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs