spaceml-org / ml4floods Goto Github PK
View Code? Open in Web Editor NEWAn ecosystem of data, models and code pipelines to tackle flooding with ML
Home Page: https://spaceml-org.github.io/ml4floods/
License: GNU Lesser General Public License v3.0
An ecosystem of data, models and code pipelines to tackle flooding with ML
Home Page: https://spaceml-org.github.io/ml4floods/
License: GNU Lesser General Public License v3.0
There's a bug in the function process_filename_train_test()
ml4floods/ml4floods/models/dataset_setup.py
Lines 143 to 164 in ce48c67
Line 162: the same filename
(i.e., the input image) is downloaded to both input_folder
and target_folder
. As a result, ground truth masks are not downloaded correctly
Two options to choose:
Description:
Issue Summary:
While running inference on a Sentinel-2 scene depicting flash flooding in Dubai using the ml4floods-1.0.1 library and pre-trained ml4floods model, cloud shadow pixels are misclassified as water. This misclassification is not observed in other scenes from the test images, such as the MSR264_18MIANDRIVAZODETAIL_DEL_v2 flood map from the test set of the WorldFloodsv2 dataset.
Steps to Reproduce:
Expected Behavior:
The model should accurately distinguish between water pixels and cloud shadows, ensuring that only actual water bodies/inundated regions are classified as such.
Actual Behavior:
Cloud shadow pixels are misclassified as water, leading to inaccurate flood maps.
Additional Information:
Trained model used: Unet multioutput S2-to-L8 in folder models/WF2_unetv2_bgriswirs.
Example image: Sen2_dubai_flood_annotated
Tutorial notebooks are going to be an strong component of the repo. Is there a way to add add this as a pytest?
It might be helpful to include meta information like the crs
and the transform
for the Viz team.
See file src/data/create_gt.py
.
for downstream tasks if would be help if the crs and transform were also (optionally?) returned. We need these to be able to rasterize the ndarray to COG
Currently, we're using the albumentations
library. I would suggest we change this to torchvision
because it's more consistent. It would require minimum API changes but we need to ensure it's consistent.
pytorch
Note: rasterio
also uses CxHxW
so we don't need to permute the channels in this case.
Source: PyTorch Docs
Example:
from torchvision import transforms, utils
# Stacked Transforms
tranform_permute = transforms.PermuteChannels()
tranform_toTensor = transforms.ToTensor()
scale = Rescale(256)
crop = RandomCrop(128)
mega_transform = transforms.Compose([Rescale(256),
RandomCrop(224)])
pt_ds = WorldFloodsDataset(image_files, image_prefix, gt_prefix, transforms=mega_transform)
albumentations
Very similar to the torchvision
except the shape is CxHxW
instead of HxWxC
. So we need to have dedicated PermuteChannels()
class to ensure we have channels that make sense.
Note: rasterio
also uses CxHxW
so we do need to permute the channels in this case.
Source: Our Notebook
Example:
# Stacked Transforms
tranform_permute = transformations.PermuteChannels()
tranform_toTensor = transformations.ToTensor()
tranform_oneHotEncoding = transformations.OneHotEncoding(num_classes=3)
mega_transform = transformations.Compose([
transform_invpermutechannels,
transform_resize,
transform_gaussnoise,
transform_motionblur,
transform_rr90,
transform_flip,
tranform_permute,
# tranform_toTensor,
# tranform_oneHotEncoding,
])
pt_ds = WorldFloodsDataset(image_files, image_prefix, gt_prefix, transforms=mega_transform)
Ingest events from Hum Data and fetch surrounding sat images, etc.
These are useful and necessary steps to acquire floodmaps. These can be used for visualization purposes OR for the MLOPs.
src/data/copernicusEMS/activations.py
- does almost everything...Note: Need to unravel because there is some ground truth processing mixed inside the file....
Todo
image_prefix
, gt_prefix
as well as the train
, test
, val
split.List[str]
)normalization
, etc).We would like to use the new create_gt.py
file found in issue #30 and should do a split of rasters: 1) one raster will be land or water and 2) one raster will be cloud or no cloud.
ToDos:
This will be useful for just inference on an already trained model. We are assuming this new image looks (more-or-less) the same as the training and test images. This is for S2, Worldfloods2.0
Could be useful for issue #9
ToDo
Ideas
Pair coded with @satyarth934 @nadia-eecs @sambuddinc
The google.cloud.api
is clunky. It really makes things annoying with having to parse the string for specifics (e.g., bucket, path, file). We have a lot of unnecessary wrappers everywhere in our utils section.
I suggest we use cloudpathlib. Very simple and similar API to Pathlib
.
Take the previous script compute_meta_tiff.py and reproduce this script which creates the gt
labels necessary for training.
Steps:
gt
rastersRecreate function files_train_test.py which will take a list of files and do the directory partition into train, test, val.
S2
images and gt
Question @gonzmg88 , should the tiling already be done?
Satellite images (input to the models) have missing values (in Sentinel-2 encoded as 0).
Ground truth mask (output to the models) could also have missing values (also encoded as the 0 class).
Currently ground truths are set to invalid when the input is invalid. worldfloods_internal/compute_meta_tiff.py
Options for training the model (how we should deal with invalids in the input/output the loss function):
[.333, .333, .333]
)If more than 1 option is implemented we can have a value in the config.yaml file that selects what's the invalid policy used for training.
This issue will do most of the preprocessing necessary to download the floodmaps to the bucket and the subsequent querying from GEE (e.g. S2 images) to be saved to the bucket. Viz team can either query the bucket directly for S2 images and associated floodmaps or use the API to to query the GEE directly.
.cog
format, data_lake_vizmart
bucketThings to check:
gs://ml4cc_data_lake/2_PROD/
. (Copy data there if needed and ask Gonzalo to change permissions to make these new data publicly accessible).jupyterbook/content/ml4ops/HOWTO_Calculate_uncertainty_maps.ipynb
jupyterbook/content/ml4ops/HOWTO_performance_metrics_workflow.ipynb
jupyterbook/content/ml4ops/HOWTO_Run_Inference_on_new_data.ipynb
jupyterbook/content/ml4ops/HOWTO_Train_models.ipynb
jupyterbook/content/ml4ops/HOWTO_Run_Inference_multioutput_binary.ipynb
jupyterbook/content/prep/demo_pytorch_transforms.ipynb
jupyterbook/content/prep/full_data_ingest.ipynb
jupyterbook/content/prep/geographic_index_demo.ipynb
jupyterbook/content/prep/gt_masks_generation.ipynb
We would like to use the create_gt.py
file found in the original worldfloods_internal
repo found here. Should work out of the box but we expect some hiccups. Later this will be improved.
ToDos: Water
ToDos: Clouds
Nice to have:
tiff
files to bucket, data_lake_mlmart
create_gt.py
tiff
files to bucket, data_lake_mlmart
This task involves querying the Copernicus EMS for flood events. We will download the zip files and save them to the ml4floods_data_lake
bucket (TBD).
Related to #13 ; this is a next step as discussed in our data backlog
Requested as high-priority item from @Lkruitwagen in viz team
Things to take into account:
Line 218 in ca80df1
Is bork.
src = '/home/lucas/ml4floods/tmp/EMSR501_AOI01_vector.geojson'
dst = 'gs://ml4floods/worldfloods/lk-dev/meta/EMSR501_AOI01_vector.geojson'
utils.save_file_to_bucket(src, dst)`
File "/home/lucas/ml4floods/src/data/utils.py", line 401, in parse_gcp_path
bucket_id = str(Path(full_path.split("gs://")[1]).parts[0])
IndexError: list index out of range
We can do better for comments.
Todo
src/data/worldfloods/dataset.py
src/data/worldfloods/download.py
src/data/worldfloods/prepare_data.py
The full preprocessing pipeline for WorldFloods1.1 and WorldFloods 2.0:
All of the steps below are based on the demo notebook found here:
no .cog
file format considerations (that we know of...)...
ingest.py
)hardutils.py
)These are useful and necessary steps to acquire floodmaps. These can be used for visualization purposes OR for the MLOPs.
softutils.py
)geojson
geojson
to ml4floods_data_lake_ETL
bucketingest.py
)We use the geojson files to get a bounding box to query GEE for S2 images
geojson
files..cog
Note: ee_download.py
softutils.py
)tiff
files to bucket, data_lake_mlmart
create_gt.py
tiff
files to bucket, data_lake_mlmart
.cog
format, data_lake_vizmart
bucket@Lkruitwagen Any opinion about any intermediate steps?
We use the geojson files to get a bounding box to query GEE for S2 images
geojson
files..tiff
.cog
Note: ee_download.py
The current API for the WorldFloods dataset is to give a list of directories where the subdirectories are defined by an image_prefix
and a gt_prefix
.
Example:
dir1
└───image_prefix
│ │ *.tiff
│ │ ...
│ └───gt_prefix
│ │ *.tiff
│ │ ...
dir2
└───image_prefix
│ │ *.tiff
│ │ ...
│ └───gt_prefix
│ │ *.tiff
│ │ ...
We want to abstract this step from the dataset itself. We want to give it an explicit list of files instead of spydering down. So we can add some extra functions to do this before.
We need a csv file with all of the filenames for the images and ground truths. A simple glob will be sufficient.
Some key columns for easy querying.
These are the original tiff images that were used before the train-test-split.
Location: Should be in the ml4floods_data_lake
directory. The csv file should be at the top of the directory where the tiffimages are located.
This will be the train/test/val split data.
Location: Should be in the ml4floods_data_lake
directory. The csv file should be at the top of the directory where the train/val/split is located.
Ref: PR 53: Pytest to environment.yml
First thing to say is that you probably don't need to write much more code. The notebooks contain the test principles, it's just an automated way for you to run them before you push changes to master etc.
My main motivations for writing tests is that they allows me to catch silent errors, quickly identify what has gone wrong and to jump into arbitrary points in the pipeline by putting in an assert False
statement and using pdb
(might not be best practice but i find it's fast). e.g.
pytest --pdb .
Key things to test:
Things to remember:
Nice to haves:
Tensor
or a List
or a Dict
etc.. Sometimes getting mypy
to play nicely and not give any errors is a bit of a faff so i would say it is more of a general principle than necessarily having mypy checks all passing. Ultimately however that is what you want.EDIT removed the link to hypothesis library because by hiding the inputs (they're generated by the library) it's harder to use the tests as a point for developer understanding of what's going on, easier to just generate your own test example
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.