GithubHelp home page GithubHelp logo

pytroll / pygac-fdr Goto Github PK

View Code? Open in Web Editor NEW
2.0 9.0 5.0 293 KB

Python package for creating a Fundamental Data Record (FDR) of AVHRR GAC data using pygac

License: GNU General Public License v3.0

Python 99.12% Shell 0.88%
avhrr gac noaa metop tiros satellite climate data record pygac

pygac-fdr's Introduction

pygac-fdr

Python package for creating a Fundamental Data Record (FDR) of AVHRR GAC data using pygac

Build Coverage PyPI version DOI

Installation

To install the latest release:

pip install pygac-fdr

To install the latest development version:

pip install git+https://github.com/pytroll/pygac-fdr

Usage

To read and calibrate AVHRR GAC level 1b data, adapt the config template in etc/pygac-fdr.yaml, then run:

pygac-fdr-run --cfg=my_config.yaml /data/avhrr_gac/NSS.GHRR.M1.D20021.S0*

Results are written into the specified output directory in netCDF format. Afterwards, collect and complement metadata of the generated netCDF files:

pygac-fdr-mda-collect --dbfile=test.sqlite3 /data/avhrr_gac/output/*

This might take some time, so the results are saved into a database. You can specify files from multiple platforms; the metadata are analyzed for each platform separately. With a large number of files you might run into limitations on the size of the command line argument ("Argument list too long"). In this case use the following command to read the list of filenames from a file (one per line):

pygac-fdr-mda-collect --dbfile=test.sqlite3 @myfiles.txt

Finally, update the netCDF metadata inplace:

pygac-fdr-mda-update --dbfile=test.sqlite3

Tips for AVHRR GAC FDR Users

Checking Global Quality Flag

The global quality flag can be checked from the command line as follows:

ncks -CH -v global_quality_flag -s "%d" myfile.nc

Cropping Overlap

Due to the data reception mechanism consecutive AVHRR GAC files often partly contain the same information. This is what we call overlap. For example some scanlines in the end of file A also occur in the beginning of file B. The overlap_free_start and overlap_free_end attributes in pygac-fdr output files indicate that overlap. There are two ways to remove it:

  • Cut overlap with subsequent file: Select scanlines 0:overlap_free_end
  • Cut overlap with preceding file: Select scanlines overlap_free_start:-1

If, in addition, users want to create daily composites, a file containing observations from two days has to be used twice: Once only the part before UTC 00:00, and once only the part after UTC 00:00. Cropping overlap and day together is a little bit more complex, because the overlap might cover UTC 00:00. That is why the pygac-fdr-crop utility is provided:

$ pygac-fdr-crop AVHRR-GAC_FDR_1C_N06_19810330T225108Z_19810331T003506Z_...nc --date 19810330
0 8260
$ pygac-fdr-crop AVHRR-GAC_FDR_1C_N06_19810330T225108Z_19810331T003506Z_...nc --date 19810331
8261 12472

The returned numbers are start- and end-scanline (0-based).

pygac-fdr's People

Contributors

dependabot[bot] avatar djhoese avatar mraspaud avatar sfinkens avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pygac-fdr's Issues

Improve overlap computation

Specify overlap_free_start/end scanlines, so that the user can decide whether to cut the orbit in the beginning or the end

Add more metadata to nc files

  • sun_earth_distance_correction_factor to nc files
  • equator crossing longitude and UTC time
  • describe quality flags using flag values/meanings
  • include static correction factors (roll, pitch, yaw?)

argument list too long for pygac-fdr-mda-collect

Using pygac-fdr-mda-collect with a wild card expression works fine if the total number of files is small, but in most cases e.g. MetOp-A with 72887 NetCDF files, I get an argument list too long error from my shell.

A possible solution could be a --read argument to indicate that the filenames argument is a single text file containing all paths.

read_gac is broken with latest satpy version

The Scene.readers attribute became private (Scene._readers) in the latest version of satpy, therefore the following line does not work anymore:

fname_info = scene.readers['avhrr_l1b_gaclac'].file_handlers['gac_lac_l1b'][0].filename_info

See: pytroll/satpy#1300

Maybe @mraspaud can help to define a clean way to access the filename info here.
If I understand the code correctly, the filename_info is generated here:
https://github.com/pytroll/satpy/blob/4aedf557e38dc155d640e67e35bdb1f5dbef55f3/satpy/readers/yaml_reader.py#L428-L429
Where it uses the trollsift.parser.parse function.

test

I like turtles

pygac-fdr-mda-collect should write checkpoints for restarts

Since the application runs for several hours, it could be useful to be able to restart from a checkpoint in case of an interruption.
Currently pygac-fdr-mda-collect collects all metadata in memory and at the end dumps this data into a database.
If it would write into the database every N files, and check if the file is already listed in the database before processing, it could be possible to restart the application.
Another benefit would be a reduction of memory consumption which currently grows to some GBs.

Unformatted log messages

When processing files with pygac-fdr-run, I see many log messages like

The following datasets were not created and may require resampling to be generated: DatasetID(name='3a', wavelength=(1.58, 1.61, 1.64), resolution=1050, polarization=None, calibration='reflectance', level=None, modifiers=()), DatasetID(name='3b', wavelength=(3.55, 3.74, 3.93), resolution=1050, polarization=None, calibration='brightness_temperature', level=None, modifiers=()), DatasetID(name='5', wavelength=(11.5, 12.0, 12.5), resolution=1050, polarization=None, calibration='brightness_temperature', level=None, modifiers=())

and

Could not load dataset 'DatasetID(name='5', wavelength=(11.5, 12.0, 12.5), resolution=1050, polarization=None, calibration='brightness_temperature', level=None, modifiers=())': Unknown dataset: 5

together with a lot of Traceback info like

Traceback (most recent call last):
File "venv/lib/python3.7/site-packages/satpy/readers/yaml_reader.py", line 781, in _load_dataset_with_area
ds = self._load_dataset_data(file_handlers, dsid, **kwargs)
File "venv/lib/python3.7/site-packages/satpy/readers/yaml_reader.py", line 666, in _load_dataset_data
proj = self._load_dataset(dsid, ds_info, file_handlers, **kwargs)
File "venv/lib/python3.7/site-packages/satpy/readers/yaml_reader.py", line 642, in _load_dataset
projectable = fh.get_dataset(dsid, ds_info)
File "venv/lib/python3.7/site-packages/satpy/readers/avhrr_l1b_gaclac.py", line 164, in get_dataset
raise ValueError('Unknown dataset: {}'.format(key.name))
ValueError: Unknown dataset: 3b

These log messages originate from satpy, but they are not formatted meaning that they do not arrive with a severity, timestamp and module name.

It would be great if these messages would also come with the same formatting as pygac_fdr messages, e.g.

[INFO: 2020-08-03 20:56:52 : pygac_fdr] Appending GAC header

Error during installation

I tried to install as described in the docs with the following steps:

git clone https://github.com/pytroll/pygac-fdr.git
cd pygac-fdr/
conda create --prefix ./venv python=3.7 anaconda
conda activate ./venv
pip install .

Unfortunately, I ended up with the following error:

Processing /DSNNAS/Repro_Temp/users/mattes/pygac_test/fdr/pygac-fdr
    ERROR: Command errored out with exit status 1:
     command: /DSNNAS/Repro_Temp/users/mattes/pygac_test/fdr/pygac-fdr/venv/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-gvzypuzb/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-gvzypuzb/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-req-build-gvzypuzb/pip-egg-info
         cwd: /tmp/pip-req-build-gvzypuzb/
    Complete output (19 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-gvzypuzb/setup.py", line 25, in <module>
        version = importlib.import_module('pygac_fdr.version').__version__
      File "/DSNNAS/Repro_Temp/users/mattes/pygac_test/fdr/pygac-fdr/venv/lib/python3.7/importlib/__init__.py", line 127, in import_module
        return _bootstrap._gcd_import(name[level:], package, level)
      File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
      File "<frozen importlib._bootstrap>", line 983, in _find_and_load
      File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
      File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
      File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
      File "<frozen importlib._bootstrap>", line 983, in _find_and_load
      File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
      File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
      File "<frozen importlib._bootstrap_external>", line 728, in exec_module
      File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
      File "/tmp/pip-req-build-gvzypuzb/pygac_fdr/__init__.py", line 19, in <module>
        from satpy.utils import get_logger
    ModuleNotFoundError: No module named 'satpy'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Seems like it does not find satpy using the current setup.py.
I manually installed satpy and re-tried and it worked.

pip install satpy
pip install .

Add another level of verbosity to pygac-fdr-mda-collect

I like to see the debug messages from the logger to get an idea about the progress, but the lines

if args.verbose:
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 22)

cause too much output.

Maybe the 'count' action could be a good solution to control the verbosity level., see https://docs.python.org/3/library/argparse.html#action

parser.add_argument('--verbose', '-v', action='count', default=0)

Attribute Finetuning

Possible improvements:

  1. Variable qual_flags – are these or could these be bitmapped flags? If so, one could try to specify these a bit more, but the creator clearly knows this from some later flag_meanings. May not be possible as they’re 2D though, which is odd.
  2. Acquisition time has a fillvalue of NaN; this seems like a good idea with floats, but it very often leads to horrible problems because the NaN propagates through operations (especially array ones) and eats up a computation from the inside. I know this because I did it to myself and suffered for years from it. It’s better to use the default float fillvalues (in python, see netCDF4.default_fillvals or netCDF4._default_fillvals in older versions). However, given that it’s probably unlikely that you have a scanline without an acquisition time, you could instead create the variable without a fill value and avoid the whole mess.
  3. add_offset and scale_factor should both have the same type (the intended type of the variable after descaling). Often one is a float and the other a long here.

Trivial attribute tweaks:

  1. Acq_time can optionally have attributes standard_name = “time” and axis=’T’
  2. Latitude and longitude can optionally have attributes axis=’Y’ or ‘X’ respectively
  3. The basics of the http://wiki.esipfed.org/index.php/Attribute_Convention_for_Data_Discovery_1-3 are there but a bit incomplete
    • There’s a time_coverage_start global attribute, but it looks wrong. There’s also no time_coverage_end. Would either have both (correct) or delete.
    • For institution, I’d have just “EUMETSAT” and additionally use some of {creator,publisher}_{name,email,url} to point to the contact details
    • product_version = “1.0.0” might be right, but I thought this was v4?To consider:
  4. (M.’s standard complaint that I hate packed variables and prefer nice clean floats even if it costs more disk.. but you can ignore this)
  5. The chunk sizes are a bit weird; often these mean there’s maybe 4 chunks per variable, which hardly seems worth it. Maybe someone did testing to determine these, but I’d feel inclined to make them a bit smaller if not.
  6. It’s nice having the GAC header, but I wouldn’t know what to do with it. Maybe worth a group-global comment pointing to some reference?

parallelize metadata collection loop

When collecting the metadata, the following loop consumes most of the processing time and would benefit from a parallelisation.

def _collect_metadata(self, filenames):
"""Collect metadata from the given level 1c files."""
records = []
for filename in filenames:
LOG.debug('Collecting metadata from {}'.format(filename))
with xr.open_dataset(filename) as ds:
midnight_line = np.float64(self._get_midnight_line(ds['acq_time']))
eq_cross_lons, eq_cross_times = self._get_equator_crossings(ds)
rec = {'platform': ds.attrs['platform'].split('>')[-1].strip(),
'start_time': ds['acq_time'].values[0],
'end_time': ds['acq_time'].values[-1],
'along_track': ds.dims['y'],
'filename': filename,
'orbit_number_start': ds.attrs['orbit_number_start'],
'orbit_number_end': ds.attrs['orbit_number_end'],
'equator_crossing_longitude_1': eq_cross_lons[0],
'equator_crossing_time_1': eq_cross_times[0],
'equator_crossing_longitude_2': eq_cross_lons[1],
'equator_crossing_time_2': eq_cross_times[1],
'midnight_line': midnight_line,
'overlap_free_start': np.nan,
'overlap_free_end': np.nan,
'global_quality_flag': QualityFlags.OK}
records.append(rec)
return records

In combination with #31 this could be implemented using

import multiprocessing

[...]
with multiprocessing.Pool(n_worker) as pool:
    for metadata in pool.imap_unordered(extract_metadata, filenames):
        session.add(metadata)
        [...]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.