PEARL ML Training Pipeline
This repo contains scripts to manage training data, workflow to create Azure ML stack and train new models that are compatible to be run on the PEARL Platform. It is based on the work on Caleb Robinson of Microsoft.
Training
-
Monitor experiments and training runs on Azure ML
-
Training Repo
-
DeepLabv3Plus Architecture + focal loss seems most promising approach
Evaluation
- We run the model over the test data set, and use the per class
SEED Data
How/Why we create Seed Data
- We have seed data for each model so during retraining the user doesn’t have to add samples for each class, so we can use the weights/biases from the retraining logistic regression sklearn model to update the weights/biases of the deep learning model and then run inference on the GPU
- The retraining seed data should have same class distribution ratios as the original training data (ie 10% water, 50% trees ect)
- I’ve been generating retraining data using the GPU enabled Azure notebooks (these should ideally be converted into scripts)
- Seed Data Creation Script
Training Dataset Creation
There are two options to create the training dataset.
Option 1. Feed LULC labels data in GeoTiff format.
naip-label-align.py and NAIPTileIndex.py provided functions on how to:
Notes:
- Install libspatialindex (dep of
rtree
which is not installed automaticaly)brew install spatialindex
- align given LULC labels to available NAIP imagery tiles on Azure public Blob;
- filter out nodata tiles;
- create name conventions;
- write it to CSVs for train, validation and test dataset by 70:20:10.
- Script will write the tiled label geoTIFF into out_dir. These files can then be uploaded to Azure blob storage
These CSVs can be deployed to AML for model training direction. Instruction will be given in the following section.
python naip-label-align.py
--label_tif_path sample.tif
--out_dir <dir-name>/
--threshold [0.0 to 1.0]
--aoi <aoi-name>
--group <group-name>
Option 2. LULC labels available as GeoJSON (vector) files, and rasterization is required.
-
Firstly, NAIP imagery that overlap with LULC label data is needed to be downloaded before the rasterization task. naip_download_pc.ipynb provides script and documentation on how you can download NAIP imagery to your AOI from MS Plentary Computer.
-
Secondly, LULC label rasterization functions and steps provided in label_rasterize.ipynb The rasterization in the order of (tree canopy on the top of the lulc layer or burn last, other_impervious on the bottom or it should be rasterized first in the order)
tree_canopy
building
water
bare_soil
roads_railroads
grass_shrub
other_impervious
Details see the notebook.
Model Training on Azure ML(AML)
If you are going to use AML to train LULC models for the first time, please go through these steps.
Configure environment
This code was tested using python 3.6.5
Create a conda environment using .pytorch-env.yaml
file and execute the scripts from the created environment.
You will need to set the following variables in your .env
bash
AZ_TENANT_ID=XXX #az account show --output table
AZ_SUB_ID=XXX #az account list --output table
AZ_WORKSPACE_NAME=XXX #User set
AZ_RESOURCE_GROUP=XXX #User set
AZ_REGION=XXX #User set
AZ_GPU_CLUSTER_NAME=XXX #User set
AZ_CPU_CLUSTER_NAME=XXX #User set
Then export all variables to your environment:
export $(cat .env);
Create Your Workspace on AML
train_azure/create_workspace.py after export your Azure credentials, this script will create AML workspace.
Create GPU Compute
This script will create GPU compute resources to your workspace on AML.
(Optional) Create CPU Compute
This script will create GPU compute resources to your workspace on AML.
Train LULC Model on AML
We have three PyTorch based Semantic Segmenation models ready for LULC model trainings, FCN, UNet and DeepLabV3+.
To train a model on AML, you will need to define or parse a few crucial parameters to the script, for instance:
TODO: Will we be providing sample csv
config = ScriptRunConfig(
source_directory="./src",
script="train.py",
compute_target=AZ_GPU_CLUSTER_NAME,
arguments=[
"--input_fn",
"sample_data/indianapolis_train.csv",
"--input_fn_val",
"sample_data/indianapolis_val.csv",
"--output_dir",
"./outputs",
"--save_most_recent",
"--num_epochs",
20,
"--num_chips",
200,
"--num_classes",
7,
"--label_transform",
"uvm",
"--model",
"deeplabv3plus",
],
)
These parameters are to be configure by the user. input_fn_X
paths should be provided by the user, and are the outputs of the data generation step (NAIP Label Algin) described above.
python train_azure/run_model.py
Evaluate the Trained Model
To compute Global F1, and class base F1 scores (written in CSV) from a trained model over latest dataset. You can use this eval script as an example.
python train_azure/run_eval.py
Seed Data Creation for PEARL
After a best performing model is selected, seed dataseed need to be created to serve PEARL. Seed Data is the model embedding layers from the trained model that is used together with users inputs training data in PEARL retraining session.
run_seeddata_creation.py will config AML and use the main seeddata creation script to create seed data for the trained best performing model.
python train_azure/run_seeddata_creation.py
(Optional) Classes Distribution
LULC Class distribution is a graph showing the porpotion of LULC pixel numbers for a trained model on PEARL. See the bar chart bellow.
train_azure/run_cls_distrib.py will guide you how to compute the classes distribution from the training dataset for the model.
python train_azure/run_cls_distrib.py