GithubHelp home page GithubHelp logo

spaceml-org / ml4floods Goto Github PK

View Code? Open in Web Editor NEW
127.0 127.0 40.0 478.17 MB

An ecosystem of data, models and code pipelines to tackle flooding with ML

Home Page: https://spaceml-org.github.io/ml4floods/

License: GNU Lesser General Public License v3.0

Makefile 0.01% Python 0.26% Jupyter Notebook 99.73% CSS 0.01% HTML 0.01%

ml4floods's People

Contributors

crpurcell avatar gonzmg88 avatar jejjohnson avatar kgupta359 avatar kipoju avatar lkruitwagen avatar margaretmz avatar nadia-eecs avatar r-strange avatar rothn avatar sambuddinc avatar satyarth934 avatar tommylees112 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ml4floods's Issues

Ground truth data not downloaded correctly from GCP

def process_filename_train_test(train_test_split_file:Optional[str]="gs://ml4cc_data_lake/2_PROD/2_Mart/worldfloods_v1_0/train_test_split.json",

There's a bug in the function process_filename_train_test()

# check correspondence input output files (assert files exists)
for idx, filename in enumerate(filenames_train_test[isplit][input_folder]):
fs = get_filesystem(filename)
assert fs.exists(filename), f"File input: {filename} does not exists"
filename_target = filenames_train_test[isplit][target_folder][idx]
assert fs.exists(filename_target), f"File target: {filename_target} does not exists"
# Download if needed and replace filenames_train_test with the downloaded version
if filename.startswith("gs://") and download[isplit]:
assert (path_to_splits is not None) and os.path.exists(path_to_splits), \
f"path_to_splits {path_to_splits} doesn't exists or not provided ad requested to download the data"
for input_target_folder in [input_folder, target_folder]:
folder_local = os.path.join(path_to_splits, isplit, input_target_folder)
os.makedirs(folder_local, exist_ok=True)
basename = os.path.basename(filename)
file_dest = os.path.join(folder_local, basename)
if not os.path.isfile(file_dest):
fs.get_file(filename, file_dest)
print(f"Downloaded ({idx}/{len(filenames_train_test[isplit][input_folder])}) {filename}")
filenames_train_test[isplit][input_folder][idx] = file_dest

Line 162: the same filename (i.e., the input image) is downloaded to both input_folder and target_folder. As a result, ground truth masks are not downloaded correctly

[Models] Consider prediction at different pyramid levels

Two options to choose:

  1. Batch ingested predictions: Save predictions as e.g. COG GeoTIFF (so that we'll have predictions at all levels of the pyramid).
  2. Live inference: Doing inference within live visualization server. This uses as input the image at the current queried pyramid layer.

Misclassification of Cloud Shadows as Water Pixels in Sentinel-2 Scene Inference

Description:
Issue Summary:
While running inference on a Sentinel-2 scene depicting flash flooding in Dubai using the ml4floods-1.0.1 library and pre-trained ml4floods model, cloud shadow pixels are misclassified as water. This misclassification is not observed in other scenes from the test images, such as the MSR264_18MIANDRIVAZODETAIL_DEL_v2 flood map from the test set of the WorldFloodsv2 dataset.

Steps to Reproduce:

  1. Load the pre-trained Unet multioutput S2-to-L8 model from ml4floods huggingface repository
  2. Run inference on a Sentinel-2 scene of flash flooding in Dubai. (S2A_MSIL1C_20240417T064631_N0510_R020_T40RCN_20240417T091941)
  3. Observe that cloud shadow pixels are classified as water.

Expected Behavior:
The model should accurately distinguish between water pixels and cloud shadows, ensuring that only actual water bodies/inundated regions are classified as such.

Actual Behavior:
Cloud shadow pixels are misclassified as water, leading to inaccurate flood maps.

Additional Information:
Trained model used: Unet multioutput S2-to-L8 in folder models/WF2_unetv2_bgriswirs.
Example image: Sen2_dubai_flood_annotated

[DataPrep] PyTorch torchvision transforms for datamodule

Currently, we're using the albumentations library. I would suggest we change this to torchvision because it's more consistent. It would require minimum API changes but we need to ensure it's consistent.


Demo w. pytorch

Note: rasterio also uses CxHxW so we don't need to permute the channels in this case.

Source: PyTorch Docs

Example:

from torchvision import transforms, utils

# Stacked Transforms
tranform_permute = transforms.PermuteChannels()
tranform_toTensor = transforms.ToTensor()

scale = Rescale(256)
crop = RandomCrop(128)
mega_transform = transforms.Compose([Rescale(256),
                               RandomCrop(224)])

pt_ds = WorldFloodsDataset(image_files, image_prefix, gt_prefix, transforms=mega_transform)

Demo w. albumentations

Very similar to the torchvision except the shape is CxHxW instead of HxWxC. So we need to have dedicated PermuteChannels() class to ensure we have channels that make sense.

Note: rasterio also uses CxHxW so we do need to permute the channels in this case.

Source: Our Notebook

Example:

# Stacked Transforms
tranform_permute = transformations.PermuteChannels()
tranform_toTensor = transformations.ToTensor()
tranform_oneHotEncoding = transformations.OneHotEncoding(num_classes=3)

mega_transform = transformations.Compose([
    transform_invpermutechannels, 
    transform_resize,
    transform_gaussnoise,
    transform_motionblur,
    transform_rr90,
    transform_flip,
    tranform_permute, 
#     tranform_toTensor, 
#     tranform_oneHotEncoding,
    ])

pt_ds = WorldFloodsDataset(image_files, image_prefix, gt_prefix, transforms=mega_transform)

[DataPrep] Unzip files downloaded from Copernicus EMS and extract them to floodmap and floodmap meta

These are useful and necessary steps to acquire floodmaps. These can be used for visualization purposes OR for the MLOPs.


TODO

  • Search through unzipped files to get shape files (3 shape files)
    • Area of Interest
    • Observed event
    • Hydrography (river); sub categories l/a
  • Build the Copernicus Meta with Filenames of the items

Useful functions

Note: Need to unravel because there is some ground truth processing mixed inside the file....


Linked Issues

[DataPrep] WorldFloods dataloader 2.0

Todo

  • Create GT given S2 Images (ref #9 )
  • Clean File structure to have the image_prefix, gt_prefix as well as the train, test, val split.
  • Load the selected files (List[str])
  • Cache the dataset
  • Initialize the Loader with some standard preprocessing schemes (normalization, etc).
  • Do a dry run with the dataloader.

[DataPrep] GT Generator 4 WF2.0

We would like to use the new create_gt.py file found in issue #30 and should do a split of rasters: 1) one raster will be land or water and 2) one raster will be cloud or no cloud.


ToDos:

  • Save raster with different bands
  • Update meta-data with band info
  • DataLoader which accounts for this

[Models] Invalid values in images and ground truths

Satellite images (input to the models) have missing values (in Sentinel-2 encoded as 0).
Ground truth mask (output to the models) could also have missing values (also encoded as the 0 class).

Currently ground truths are set to invalid when the input is invalid. worldfloods_internal/compute_meta_tiff.py

Options for training the model (how we should deal with invalids in the input/output the loss function):

  • Option 1: replace invalid in output with land class (inpainting?)
  • Option 2: learn a separate invalid class
  • Option 3: Mask out invalids from loss
  • Option 4: use uncertainty as a proxy to invalids ([.333, .333, .333])
  • Option 5: use a model to inpaint the image (cloud or/and missing removal model)

If more than 1 option is implemented we can have a value in the config.yaml file that selects what's the invalid policy used for training.

[Viz] Query GEE given Floodmaps

This issue will do most of the preprocessing necessary to download the floodmaps to the bucket and the subsequent querying from GEE (e.g. S2 images) to be saved to the bucket. Viz team can either query the bucket directly for S2 images and associated floodmaps or use the API to to query the GEE directly.


TODO

  • Query GEE for S2 (or S1) images
    • query given floodmaps from our bucket
  • Save them to .cog format, data_lake_vizmart bucket
  • Query bucket of already saved floodmaps and images

Linked Issues

#45

Check notebooks work

Things to check:

  1. Check that the notebooks run.
  2. Check if they run in collab.
  3. Check that they point to data on gs://ml4cc_data_lake/2_PROD/. (Copy data there if needed and ask Gonzalo to change permissions to make these new data publicly accessible).
  • jupyterbook/content/ml4ops/HOWTO_Calculate_uncertainty_maps.ipynb
  • jupyterbook/content/ml4ops/HOWTO_performance_metrics_workflow.ipynb
  • jupyterbook/content/ml4ops/HOWTO_Run_Inference_on_new_data.ipynb
  • jupyterbook/content/ml4ops/HOWTO_Train_models.ipynb
  • jupyterbook/content/ml4ops/HOWTO_Run_Inference_multioutput_binary.ipynb
  • jupyterbook/content/prep/demo_pytorch_transforms.ipynb
  • jupyterbook/content/prep/full_data_ingest.ipynb
  • jupyterbook/content/prep/geographic_index_demo.ipynb
  • jupyterbook/content/prep/gt_masks_generation.ipynb

[DataPrep] GT Generator 4 WF1.1

We would like to use the create_gt.py file found in the original worldfloods_internal repo found here. Should work out of the box but we expect some hiccups. Later this will be improved.


ToDos: Water

  • Filter the Land
  • Generate Flood Map
  • Compute Water w. Floodmap

ToDos: Clouds

  • Load Cloud Tiffs
  • Generate Cloud Map

[Models] Notebook with metrics in the WorldFloods test dataset

Things to take into account:

  • Function that given a pytorch dataset returns a confusion matrix (len(dataset), num_class, num_class).
  • Function that takes that and output an JSON of average metrics
  • Consider that eventually we might want to stratify results by permanent/non-permanent water
  • Consider that eventually we might want to save the output preds for viz.
  • Consider functionality to measure performance on a per-flood event basis

send file to bucket?

def save_file_to_bucket(target_directory: str, source_directory: str):

Is bork.

src = '/home/lucas/ml4floods/tmp/EMSR501_AOI01_vector.geojson'
dst = 'gs://ml4floods/worldfloods/lk-dev/meta/EMSR501_AOI01_vector.geojson'
utils.save_file_to_bucket(src, dst)`

File "/home/lucas/ml4floods/src/data/utils.py", line 401, in parse_gcp_path
bucket_id = str(Path(full_path.split("gs://")[1]).parts[0])
IndexError: list index out of range

[DataPrep] Need better comments...

We can do better for comments.

Todo

  • src/data/worldfloods/dataset.py
  • src/data/worldfloods/download.py
  • src/data/worldfloods/prepare_data.py

[DataPrep] WorldFloods 1.1 and WorldFloods 2.0 Full Pipeline

The full preprocessing pipeline for WorldFloods1.1 and WorldFloods 2.0:

  1. Query Copernicus EMS
  2. Generate Floodmaps
  3. Query GEE with FloodMaps
  4. Generate GT with floodmaps and S2 images.

Visual Pipeline

Group SpaceML - WorldFloods 1 1, 2 0, Pipeline


Current Contributors


Demo

All of the steps below are based on the demo notebook found here:

no .cog file format considerations (that we know of...)...


Copernicus Query & Save (ingest.py)

  • Download the Zip Files from Copernicus EMS
  • Unzip the files into appropriate file directory structure

Copernicus Post-Processing (hardutils.py)

These are useful and necessary steps to acquire floodmaps. These can be used for visualization purposes OR for the MLOPs.

  • Search through unzipped files to get shape files (3 shape files)
    • Area of Interest
    • Observed event
    • Hydrography (river); sub categories l/a
  • Build the Copernicus Meta with Filenames of the items

Build FloodMap (softutils.py)

  • query copernicus meta data to get the names and shape files
  • open with geopandas
  • collapse polygons with labels (e.g. flood, hydro..., )
  • convert new shape file to geojson
  • store geojson to ml4floods_data_lake_ETL bucket
  • store the floodmap meta data that was queried...? @gonzmg88

Sentinel-2 (S2) (ingest.py)

We use the geojson files to get a bounding box to query GEE for S2 images

  • query database (e.g., data, event, alert) for stored geojson files.
  • using the polygons, query GEE platform
  • download S2 tiles that intersect to a bucket
  • save as .cog

Note: ee_download.py

Build Ground Truth (softutils.py)

Cloud or No Cloud

  • Query S2 database for images in ROI
  • Last band from S2 image
  • save tiff files to bucket, data_lake_mlmart

Water or No Water

  • Query S2 database for images in ROI
  • Magic...... See: create_gt.py
  • save tiff files to bucket, data_lake_mlmart

Visualization Territory

  • Query GEE given floodmaps
  • Save them to .cog format, data_lake_vizmart bucket

@Lkruitwagen Any opinion about any intermediate steps?

[DataPrep] Query S2 Images given a floodmap

We use the geojson files to get a bounding box to query GEE for S2 images


TODO

  • query database (e.g., data, event, alert) for stored geojson files.
  • using the polygons, query GEE platform
  • download S2 tiles that intersect to a bucket
  • option 1 - save a simple .tiff
    • Pipe to ML Mart
  • option 2 - save as .cog

Note: ee_download.py


Linked Issues

#45

[Data Prep] Sentinel-1 Images

  • Grab sentinel-1 bucket location
  • Create Data object
  • Return json with object in zarr or tif format
  • Match with Sen2
  • Grab a drink
  • Profit

[DataPrep] Remove Dependency on Path for PyTorch Dataset

The current API for the WorldFloods dataset is to give a list of directories where the subdirectories are defined by an image_prefix and a gt_prefix.

Example:

dir1
└───image_prefix
│   │   *.tiff
│   │   ...
│   └───gt_prefix
│   │   *.tiff
│   │   ...
dir2
└───image_prefix
│   │   *.tiff
│   │   ...
│   └───gt_prefix
│   │   *.tiff
│   │   ...

We want to abstract this step from the dataset itself. We want to give it an explicit list of files instead of spydering down. So we can add some extra functions to do this before.

[DataPrep] csv file with train/val/test files

We need a csv file with all of the filenames for the images and ground truths. A simple glob will be sufficient.


Format

Some key columns for easy querying.

  • filename, e.g. filename
  • filepath, e.g. path/to/file
  • bucket, e.g. ml4floods
  • split - e.g., train, test, val

TiffImages

These are the original tiff images that were used before the train-test-split.

Location: Should be in the ml4floods_data_lake directory. The csv file should be at the top of the directory where the tiffimages are located.

  • CSV File

Train/Test/Val

This will be the train/test/val split data.

Location: Should be in the ml4floods_data_lake directory. The csv file should be at the top of the directory where the train/val/split is located.

  • CSVFile

[DataPrep] PyTorch-Lightning DataModule

We need a PyTorch-Lightning datamodule which will hold all of the methods, e.g. prepare_, download_, train_test_split_, etc.


TODO

  • prepare_data
  • setup
  • train_dataloader
  • val_dataloader
  • test_dataloader

Resources to Learn More

Some testing ideas / principles

Ref: PR 53: Pytest to environment.yml

First thing to say is that you probably don't need to write much more code. The notebooks contain the test principles, it's just an automated way for you to run them before you push changes to master etc.

My main motivations for writing tests is that they allows me to catch silent errors, quickly identify what has gone wrong and to jump into arbitrary points in the pipeline by putting in an assert False statement and using pdb (might not be best practice but i find it's fast). e.g.

pytest --pdb .

Key things to test:

  • Shapes of input and output datasets match expectations
  • Missing values (they need to be found and dealt with)
  • Value Ranges are within expected bounds (e.g. probability 0-1)
  • Toy datasets that are very small (and allow the tests to run quickly) are really important to pass through the pipeline.
  • In the past I have written assertions that the model is "learning", i.e. losses fall after 1 or 2 epochs (but i'm not sure this is best practice because of the random potential for this not to be the case with SGD. I mean you expect it but it's not guaranteed)

Things to remember:

  • Deterministically seed the random number generator

Nice to haves:

  • I think typing is great from a documentation tool, it allows a user to know whether a function is going to have a Tensor or a List or a Dict etc.. Sometimes getting mypy to play nicely and not give any errors is a bit of a faff so i would say it is more of a general principle than necessarily having mypy checks all passing. Ultimately however that is what you want.

EDIT removed the link to hypothesis library because by hiding the inputs (they're generated by the library) it's harder to use the tests as a point for developer understanding of what's going on, easier to just generate your own test example

[DataPrep] Build Floodmap from Copernicus Post-Processing


TODO

  • query copernicus meta data to get the names and shape files
  • open with geopandas
  • collapse polygons with labels (e.g. flood, hydro..., )
  • convert new shape file to geojson
  • store geojson to ml4floods_data_lake_ETL bucket
  • store the floodmap meta data that was queried...? @gonzmg88

Linked Issue

#45

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.