spaceml-org / ml4floods Goto Github PK

An ecosystem of data, models and code pipelines to tackle flooding with ML

Home Page: https://spaceml-org.github.io/ml4floods/

License: GNU Lesser General Public License v3.0

Makefile 0.01% Python 0.26% Jupyter Notebook 99.73% CSS 0.01% HTML 0.01%

ml4floods's Introduction

ML4Floods is an end-to-end ML pipeline for flood extent estimation: from data preprocessing, model training, model deployment to visualization. Here you can find the WorldFloodsV2🌊 dataset and trained models 🤗 for flood extent estimation in Sentinel-2 and Landsat.

Install

Install from pip:

pip install ml4floods

Install the latest version from GitHub:

pip install git+https://github.com/spaceml-org/ml4floods#egg=ml4floods

Docs

These tutorials may help you explore the datasets and models:

The WorldFloods database

The WorldFloods database contains 509 pairs of Sentinel-2 images and flood segmentation masks. It requires approximately 76GB of hard-disk storage.

The WorldFloods database and all pre-trained models are released under a Creative Commons non-commercial licence

To download the WorldFloods database or the pretrained flood segmentation models see the instructions to download the database.

Cite

If you find this work useful please cite:

@article{portales-julia_global_2023,
	title = {Global flood extent segmentation in optical satellite images},
	volume = {13},
	issn = {2045-2322},
	doi = {10.1038/s41598-023-47595-7},
	number = {1},
	urldate = {2023-11-30},
	journal = {Scientific Reports},
	author = {Portalés-Julià, Enrique and Mateo-García, Gonzalo and Purcell, Cormac and Gómez-Chova, Luis},
	month = nov,
	year = {2023},
	pages = {20316},
}
@article{mateo-garcia_towards_2021,
	title = {Towards global flood mapping onboard low cost satellites with machine learning},
	volume = {11},
	issn = {2045-2322},
	doi = {10.1038/s41598-021-86650-z},
	number = {1},
	urldate = {2021-04-01},
	journal = {Scientific Reports},
	author = {Mateo-Garcia, Gonzalo and Veitch-Michaelis, Joshua and Smith, Lewis and Oprea, Silviu Vlad and Schumann, Guy and Gal, Yarin and Baydin, Atılım Güneş and Backes, Dietmar},
	month = mar,
	year = {2021},
	pages = {7249},
}

About

ML4Floods has been funded by the United Kingdom Space Agency (UKSA) and led by Trillium Technologies. In addition, this research has been partially supported by the DEEPCLOUD project (PID2019-109026RB-I00) funded by the Spanish Ministry of Science and Innovation (MCIN/AEI/10.13039/501100011033) and the European Union (NextGenerationEU).

ml4floods's People

Contributors

Stargazers

Watchers

ml4floods's Issues

[DataPrep] Build Floodmap from Copernicus Post-Processing

TODO

query copernicus meta data to get the names and shape files
open with geopandas
collapse polygons with labels (e.g. flood, hydro..., )
convert new shape file to geojson
store geojson to ml4floods_data_lake_ETL bucket
store the floodmap meta data that was queried...? @gonzmg88

Linked Issue

#45

[DataPrep] Save data class as JSON, with byte encoding if appropriate

[DataPrep] A function to compute the gt given the S2 Images

Take the previous script compute_meta_tiff.py and reproduce this script which creates the gt labels necessary for training.

Steps:

Gather Exisitng Files (S2 images, floodmap)
Produce gt rasters
Dump them in appropriate structure (use scripts from issue #8 )

[Models] Invalid values in images and ground truths

Satellite images (input to the models) have missing values (in Sentinel-2 encoded as 0).
Ground truth mask (output to the models) could also have missing values (also encoded as the 0 class).

Currently ground truths are set to invalid when the input is invalid. worldfloods_internal/compute_meta_tiff.py

Options for training the model (how we should deal with invalids in the input/output the loss function):

Option 1: replace invalid in output with land class (inpainting?)
Option 2: learn a separate invalid class
Option 3: Mask out invalids from loss
Option 4: use uncertainty as a proxy to invalids ([.333, .333, .333])
Option 5: use a model to inpaint the image (cloud or/and missing removal model)

If more than 1 option is implemented we can have a value in the config.yaml file that selects what's the invalid policy used for training.

[Data Prep] Sentinel-1 Images

[DataPrep] GT Generator 4 WF2.0

We would like to use the new create_gt.py file found in issue #30 and should do a split of rasters: 1) one raster will be land or water and 2) one raster will be cloud or no cloud.

ToDos:

Save raster with different bands
Update meta-data with band info
DataLoader which accounts for this

[Models] Notebook with metrics in the WorldFloods test dataset

Things to take into account:

Function that given a pytorch dataset returns a confusion matrix (len(dataset), num_class, num_class).
Function that takes that and output an JSON of average metrics
Consider that eventually we might want to stratify results by permanent/non-permanent water
Consider that eventually we might want to save the output preds for viz.
Consider functionality to measure performance on a per-flood event basis

[Models] Model training WorldFloods

Implement training pipeline in pytorch lightning porting code from https://gitlab.com/frontierdevelopmentlab/disaster-prevention/cubesatfloods
Make it flexible for trying different loss functions
Add metrics to W&B logger
Plot some image results at the end of each epoch

Nice to have:

Weights and Biases integration
Look into dvc and sacred (open-sourced experiment management systems)

[DataPrep] Save image arrays as zarr

Check notebooks work

Things to check:

Check that the notebooks run.
Check if they run in collab.
Check that they point to data on gs://ml4cc_data_lake/2_PROD/. (Copy data there if needed and ask Gonzalo to change permissions to make these new data publicly accessible).

jupyterbook/content/ml4ops/HOWTO_Calculate_uncertainty_maps.ipynb
jupyterbook/content/ml4ops/HOWTO_performance_metrics_workflow.ipynb
jupyterbook/content/ml4ops/HOWTO_Run_Inference_on_new_data.ipynb
jupyterbook/content/ml4ops/HOWTO_Train_models.ipynb
jupyterbook/content/ml4ops/HOWTO_Run_Inference_multioutput_binary.ipynb
jupyterbook/content/prep/demo_pytorch_transforms.ipynb
jupyterbook/content/prep/full_data_ingest.ipynb
jupyterbook/content/prep/geographic_index_demo.ipynb
jupyterbook/content/prep/gt_masks_generation.ipynb

[DataPrep] Unzip files downloaded from Copernicus EMS and extract them to floodmap and floodmap meta

These are useful and necessary steps to acquire floodmaps. These can be used for visualization purposes OR for the MLOPs.

TODO

Search through unzipped files to get shape files (3 shape files)
- Area of Interest
- Observed event
- Hydrography (river); sub categories l/a
Build the Copernicus Meta with Filenames of the items

Useful functions

src/data/copernicusEMS/activations.py - does almost everything...

Note: Need to unravel because there is some ground truth processing mixed inside the file....

Linked Issues

[DataPrep] Return more metadata for masks

It might be helpful to include meta information like the crs and the transform for the Viz team.

See file src/data/create_gt.py.

Original Comment by @Lkruitwagen

for downstream tasks if would be help if the crs and transform were also (optionally?) returned. We need these to be able to rasterize the ndarray to COG

[Data Prep / Enhancement] Add Hum Data Events to Pipeline

Ingest events from Hum Data and fetch surrounding sat images, etc.

Create a notebook to do this and run scripts
Integrate this with any real-time ingestion pipeline associated with Copernicus EMS #38 if it exists

[DataPrep] Build WorldFloods 1.1/2.0 Ground Truth

Cloud or No Cloud

Query S2 database for images in ROI
Last band from S2 image
save tiff files to bucket, data_lake_mlmart

Water or No Water

Query S2 database for images in ROI
Magic...... See: create_gt.py
save tiff files to bucket, data_lake_mlmart

Linked Issues

[DataPrep] Download the Copernicus raw zip files to data_lake

This task involves querying the Copernicus EMS for flood events. We will download the zip files and save them to the ml4floods_data_lake bucket (TBD).

TODO

Download the Zip Files from Copernicus EMS
Unzip the files into appropriate file directory structure

Linked Issues

send file to bucket?

ml4floods/src/data/utils.py

Line 218 in ca80df1

def save_file_to_bucket(target_directory: str, source_directory: str):

Is bork.

src = '/home/lucas/ml4floods/tmp/EMSR501_AOI01_vector.geojson'
dst = 'gs://ml4floods/worldfloods/lk-dev/meta/EMSR501_AOI01_vector.geojson'
utils.save_file_to_bucket(src, dst)`

File "/home/lucas/ml4floods/src/data/utils.py", line 401, in parse_gcp_path
bucket_id = str(Path(full_path.split("gs://")[1]).parts[0])
IndexError: list index out of range

[DataPrep] Replace google.cloud.storage API with cloudpathlib

The google.cloud.api is clunky. It really makes things annoying with having to parse the string for specifics (e.g., bucket, path, file). We have a lot of unnecessary wrappers everywhere in our utils section.

I suggest we use cloudpathlib. Very simple and similar API to Pathlib.

Some testing ideas / principles

Ref: PR 53: Pytest to environment.yml

First thing to say is that you probably don't need to write much more code. The notebooks contain the test principles, it's just an automated way for you to run them before you push changes to master etc.

My main motivations for writing tests is that they allows me to catch silent errors, quickly identify what has gone wrong and to jump into arbitrary points in the pipeline by putting in an assert False statement and using pdb (might not be best practice but i find it's fast). e.g.

pytest --pdb .

Key things to test:

Shapes of input and output datasets match expectations
Missing values (they need to be found and dealt with)
Value Ranges are within expected bounds (e.g. probability 0-1)
Toy datasets that are very small (and allow the tests to run quickly) are really important to pass through the pipeline.
In the past I have written assertions that the model is "learning", i.e. losses fall after 1 or 2 epochs (but i'm not sure this is best practice because of the random potential for this not to be the case with SGD. I mean you expect it but it's not guaranteed)

Things to remember:

Deterministically seed the random number generator

Nice to haves:

I think typing is great from a documentation tool, it allows a user to know whether a function is going to have a Tensor or a List or a Dict etc.. Sometimes getting mypy to play nicely and not give any errors is a bit of a faff so i would say it is more of a general principle than necessarily having mypy checks all passing. Ultimately however that is what you want.

EDIT removed the link to hypothesis library because by hiding the inputs (they're generated by the library) it's harder to use the tests as a point for developer understanding of what's going on, easier to just generate your own test example

[DataPrep] PyTorch torchvision transforms for datamodule

Currently, we're using the albumentations library. I would suggest we change this to torchvision because it's more consistent. It would require minimum API changes but we need to ensure it's consistent.

Demo w. `pytorch`

Note: rasterio also uses CxHxW so we don't need to permute the channels in this case.

Source: PyTorch Docs

Example:

from torchvision import transforms, utils

# Stacked Transforms
tranform_permute = transforms.PermuteChannels()
tranform_toTensor = transforms.ToTensor()

scale = Rescale(256)
crop = RandomCrop(128)
mega_transform = transforms.Compose([Rescale(256),
                               RandomCrop(224)])

pt_ds = WorldFloodsDataset(image_files, image_prefix, gt_prefix, transforms=mega_transform)

Demo w. `albumentations`

Very similar to the torchvision except the shape is CxHxW instead of HxWxC. So we need to have dedicated PermuteChannels() class to ensure we have channels that make sense.

Note: rasterio also uses CxHxW so we do need to permute the channels in this case.

Source: Our Notebook

Example:

# Stacked Transforms
tranform_permute = transformations.PermuteChannels()
tranform_toTensor = transformations.ToTensor()
tranform_oneHotEncoding = transformations.OneHotEncoding(num_classes=3)

mega_transform = transformations.Compose([
    transform_invpermutechannels, 
    transform_resize,
    transform_gaussnoise,
    transform_motionblur,
    transform_rr90,
    transform_flip,
    tranform_permute, 
#     tranform_toTensor, 
#     tranform_oneHotEncoding,
    ])

pt_ds = WorldFloodsDataset(image_files, image_prefix, gt_prefix, transforms=mega_transform)

[Models] Test tutorial notebooks don't break

Tutorial notebooks are going to be an strong component of the repo. Is there a way to add add this as a pytest?

[Models] Inference with Sen1Floods11 pre-trained model

Modify the inference pipeline to include SAR data and models
Add a notebook similar to the worldfloods inference to do inference with Sen1Floods11 model
Do research on what models and data are available https://github.com/cloudtostreet/Sen1Floods11

[DataPrep] Remove Dependency on Path for PyTorch Dataset

The current API for the WorldFloods dataset is to give a list of directories where the subdirectories are defined by an image_prefix and a gt_prefix.

Example:

dir1
└───image_prefix
│   │   *.tiff
│   │   ...
│   └───gt_prefix
│   │   *.tiff
│   │   ...
dir2
└───image_prefix
│   │   *.tiff
│   │   ...
│   └───gt_prefix
│   │   *.tiff
│   │   ...

We want to abstract this step from the dataset itself. We want to give it an explicit list of files instead of spydering down. So we can add some extra functions to do this before.

[DataPrep] Have a function for getting new S2 images from Copernicus Open Hub (SAFE format)

This will be useful for just inference on an already trained model. We are assuming this new image looks (more-or-less) the same as the training and test images. This is for S2, Worldfloods2.0

Could be useful for issue #9

ToDo

[Models] Consider prediction at different pyramid levels

Two options to choose:

Batch ingested predictions: Save predictions as e.g. COG GeoTIFF (so that we'll have predictions at all levels of the pyramid).
Live inference: Doing inference within live visualization server. This uses as input the image at the current queried pyramid layer.

Fetch Copernicus EMS Event, Related Satellite Images, and Compute Ground Truth

Related to #13 ; this is a next step as discussed in our data backlog

Requested as high-priority item from @Lkruitwagen in viz team

Fetch Copernicus EMS Event
Fetch Related Satellite Images
Compute Ground Truth

Ground truth data not downloaded correctly from GCP

ml4floods/ml4floods/models/dataset_setup.py

Line 90 in ce48c67

 def process_filename_train_test(train_test_split_file:Optional[str]="gs://ml4cc_data_lake/2_PROD/2_Mart/worldfloods_v1_0/train_test_split.json", 

There's a bug in the function process_filename_train_test()

ml4floods/ml4floods/models/dataset_setup.py

Lines 143 to 164 in ce48c67

 # check correspondence input output files (assert files exists) 

 for idx, filename in enumerate(filenames_train_test[isplit][input_folder]): 

 fs = get_filesystem(filename) 

 assert fs.exists(filename), f"File input: {filename} does not exists" 

 filename_target = filenames_train_test[isplit][target_folder][idx] 

 assert fs.exists(filename_target), f"File target: {filename_target} does not exists" 

 # Download if needed and replace filenames_train_test with the downloaded version 

 if filename.startswith("gs://") and download[isplit]: 

 assert (path_to_splits is not None) and os.path.exists(path_to_splits), \ 

 f"path_to_splits {path_to_splits} doesn't exists or not provided ad requested to download the data" 

 for input_target_folder in [input_folder, target_folder]: 

 folder_local = os.path.join(path_to_splits, isplit, input_target_folder) 

 os.makedirs(folder_local, exist_ok=True) 

 basename = os.path.basename(filename) 

 file_dest = os.path.join(folder_local, basename) 

 if not os.path.isfile(file_dest): 

 fs.get_file(filename, file_dest) 

 print(f"Downloaded ({idx}/{len(filenames_train_test[isplit][input_folder])}) {filename}") 

 filenames_train_test[isplit][input_folder][idx] = file_dest

Line 162: the same filename (i.e., the input image) is downloaded to both input_folder and target_folder. As a result, ground truth masks are not downloaded correctly

[DataPrep] GT Generator 4 WF1.1

We would like to use the create_gt.py file found in the original worldfloods_internal repo found here. Should work out of the box but we expect some hiccups. Later this will be improved.

ToDos: Water

Filter the Land
Generate Flood Map
Compute Water w. Floodmap

ToDos: Clouds

Load Cloud Tiffs
Generate Cloud Map

[DataPrep] WorldFloods dataloader 2.0

Todo

Create GT given S2 Images (ref #9 )
Clean File structure to have the image_prefix, gt_prefix as well as the train, test, val split.
Load the selected files (List[str])
Cache the dataset
Initialize the Loader with some standard preprocessing schemes (normalization, etc).
Do a dry run with the dataloader.

[DataPrep] Need demo HOW-TO notebooks.

Ideas

Mounting the drives.
Taking a dirty folder and organize it
Taking a clean folder and loading the data (w. slicing)
Loading the dataset/dataloader.
Colab Worthy

Pair coded with @satyarth934 @nadia-eecs @sambuddinc

[Viz] Query GEE given Floodmaps

This issue will do most of the preprocessing necessary to download the floodmaps to the bucket and the subsequent querying from GEE (e.g. S2 images) to be saved to the bucket. Viz team can either query the bucket directly for S2 images and associated floodmaps or use the API to to query the GEE directly.

TODO

Query GEE for S2 (or S1) images
- query given floodmaps from our bucket
Save them to .cog format, data_lake_vizmart bucket

Query bucket of already saved floodmaps and images

Linked Issues

#45

[DataPrep] WorldFloods 1.1 and WorldFloods 2.0 Full Pipeline

The full preprocessing pipeline for WorldFloods1.1 and WorldFloods 2.0:

Query Copernicus EMS
Generate Floodmaps
Query GEE with FloodMaps
Generate GT with floodmaps and S2 images.

Visual Pipeline

Current Contributors

@nadia-eecs

@gonzmg88

Demo

All of the steps below are based on the demo notebook found here:

HOWTO_download_copernicus_ems_data.ipynb.

no .cog file format considerations (that we know of...)...

Copernicus Query & Save (`ingest.py`)

Download the Zip Files from Copernicus EMS
Unzip the files into appropriate file directory structure

Copernicus Post-Processing (`hardutils.py`)

These are useful and necessary steps to acquire floodmaps. These can be used for visualization purposes OR for the MLOPs.

Search through unzipped files to get shape files (3 shape files)
- Area of Interest
- Observed event
- Hydrography (river); sub categories l/a
Build the Copernicus Meta with Filenames of the items

Build FloodMap (`softutils.py`)

query copernicus meta data to get the names and shape files
open with geopandas
collapse polygons with labels (e.g. flood, hydro..., )
convert new shape file to geojson
store geojson to ml4floods_data_lake_ETL bucket
store the floodmap meta data that was queried...? @gonzmg88

Sentinel-2 (S2) (`ingest.py`)

We use the geojson files to get a bounding box to query GEE for S2 images

query database (e.g., data, event, alert) for stored geojson files.
using the polygons, query GEE platform
download S2 tiles that intersect to a bucket
save as .cog
- Pipe to Viz Mart @Lkruitwagen

Note: ee_download.py

Build Ground Truth (`softutils.py`)

Cloud or No Cloud

Query S2 database for images in ROI
Last band from S2 image
save tiff files to bucket, data_lake_mlmart

Water or No Water

Query S2 database for images in ROI
Magic...... See: create_gt.py
save tiff files to bucket, data_lake_mlmart

Visualization Territory

Query GEE given floodmaps
Save them to .cog format, data_lake_vizmart bucket

@Lkruitwagen Any opinion about any intermediate steps?

[DataPrep] Create appropriate file structure for ML

Recreate function files_train_test.py which will take a list of files and do the directory partition into train, test, val.

Need root folder with S2 images and gt
Divy them into train, test, val.
Copy files.

Question @gonzmg88 , should the tiling already be done?

[DataPrep] Need better comments...

We can do better for comments.

Todo

src/data/worldfloods/dataset.py
src/data/worldfloods/download.py
src/data/worldfloods/prepare_data.py

Misclassification of Cloud Shadows as Water Pixels in Sentinel-2 Scene Inference

Description:
Issue Summary:
While running inference on a Sentinel-2 scene depicting flash flooding in Dubai using the ml4floods-1.0.1 library and pre-trained ml4floods model, cloud shadow pixels are misclassified as water. This misclassification is not observed in other scenes from the test images, such as the MSR264_18MIANDRIVAZODETAIL_DEL_v2 flood map from the test set of the WorldFloodsv2 dataset.

Steps to Reproduce:

Load the pre-trained Unet multioutput S2-to-L8 model from ml4floods huggingface repository
Run inference on a Sentinel-2 scene of flash flooding in Dubai. (S2A_MSIL1C_20240417T064631_N0510_R020_T40RCN_20240417T091941)
Observe that cloud shadow pixels are classified as water.

Expected Behavior:
The model should accurately distinguish between water pixels and cloud shadows, ensuring that only actual water bodies/inundated regions are classified as such.

Actual Behavior:
Cloud shadow pixels are misclassified as water, leading to inaccurate flood maps.

Additional Information:
Trained model used: Unet multioutput S2-to-L8 in folder models/WF2_unetv2_bgriswirs.
Example image: Sen2_dubai_flood_annotated

[DataPrep] csv file with train/val/test files

We need a csv file with all of the filenames for the images and ground truths. A simple glob will be sufficient.

Format

Some key columns for easy querying.

filename, e.g. filename
filepath, e.g. path/to/file
bucket, e.g. ml4floods
split - e.g., train, test, val

TiffImages

These are the original tiff images that were used before the train-test-split.

Location: Should be in the ml4floods_data_lake directory. The csv file should be at the top of the directory where the tiffimages are located.

CSV File

Train/Test/Val

This will be the train/test/val split data.

Location: Should be in the ml4floods_data_lake directory. The csv file should be at the top of the directory where the train/val/split is located.

CSVFile

[DataPrep] Pull methods out of data class into utils directory, hold image data inside data class

[DataPrep] Query S2 Images given a floodmap

We use the geojson files to get a bounding box to query GEE for S2 images

TODO

query database (e.g., data, event, alert) for stored geojson files.
using the polygons, query GEE platform
download S2 tiles that intersect to a bucket
option 1 - save a simple .tiff
- Pipe to ML Mart
option 2 - save as .cog
- Pipe to Viz Mart @Lkruitwagen

Note: ee_download.py

Linked Issues

#45

[DataPrep] PyTorch-Lightning DataModule

We need a PyTorch-Lightning datamodule which will hold all of the methods, e.g. prepare_, download_, train_test_split_, etc.

	# check correspondence input output files (assert files exists)
	for idx, filename in enumerate(filenames_train_test[isplit][input_folder]):
	fs = get_filesystem(filename)
	assert fs.exists(filename), f"File input: {filename} does not exists"

	filename_target = filenames_train_test[isplit][target_folder][idx]
	assert fs.exists(filename_target), f"File target: {filename_target} does not exists"

	# Download if needed and replace filenames_train_test with the downloaded version
	if filename.startswith("gs://") and download[isplit]:
	assert (path_to_splits is not None) and os.path.exists(path_to_splits), \
	f"path_to_splits {path_to_splits} doesn't exists or not provided ad requested to download the data"

	for input_target_folder in [input_folder, target_folder]:
	folder_local = os.path.join(path_to_splits, isplit, input_target_folder)
	os.makedirs(folder_local, exist_ok=True)
	basename = os.path.basename(filename)
	file_dest = os.path.join(folder_local, basename)
	if not os.path.isfile(file_dest):
	fs.get_file(filename, file_dest)
	print(f"Downloaded ({idx}/{len(filenames_train_test[isplit][input_folder])}) {filename}")
	filenames_train_test[isplit][input_folder][idx] = file_dest

spaceml-org / ml4floods Goto Github PK

ml4floods's Introduction

Install

Docs

The WorldFloods database

Cite

About

ml4floods's People

Contributors

Stargazers

Watchers

Forkers

ml4floods's Issues

TODO

Linked Issue

TODO

Useful functions

Linked Issues

Original Comment by @Lkruitwagen

Cloud or No Cloud

Water or No Water

Linked Issues

TODO

Linked Issues

Demo w. pytorch

Demo w. albumentations

TODO

Linked Issues

Visual Pipeline

Current Contributors

Demo

Copernicus Query & Save (ingest.py)

Copernicus Post-Processing (hardutils.py)

Build FloodMap (softutils.py)

Sentinel-2 (S2) (ingest.py)

Build Ground Truth (softutils.py)

Cloud or No Cloud

Water or No Water

Visualization Territory

Format

TiffImages

Train/Test/Val

TODO

Linked Issues

TODO

Resources to Learn More

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Demo w. `pytorch`

Demo w. `albumentations`

Copernicus Query & Save (`ingest.py`)

Copernicus Post-Processing (`hardutils.py`)

Build FloodMap (`softutils.py`)

Sentinel-2 (S2) (`ingest.py`)

Build Ground Truth (`softutils.py`)