jannishoch / copro Goto Github PK

View Code? Open in Web Editor NEW

6.0 3.0 0.0 468.24 MB

(ML) model for computing conflict risk from climate, environmental, and societal drivers.

Home Page: https://copro.readthedocs.io/en/latest/

License: MIT License

Python 7.98% Shell 0.03% Jupyter Notebook 91.99%

climate conflict risk projection security environment

copro's Introduction

CoPro

Welcome to CoPro, a machine-learning tool for conflict risk projections based on climate, environmental, and societal drivers.

https://readthedocs.org/projects/copro/badge/?version=latest

https://img.shields.io/github/v/release/JannisHoch/copro

https://badges.frapsoft.com/os/v2/open-source.svg?v=103

Model purpose

As primary model output, CoPro provides maps of conflict risk.

To that end, it employs observed conflicts as target data together with (user-provided) socio-economic and environmental sample data to train different classifiers (RFClassifier, kNearestClassifier, and Support Vector Classifier). While the samples have the units of the data, the target value is converted to Boolean, where a 0 indicates no conflict occurrence and 1 indicates occurrence. To capture the geographical variability of conflict and socio-environmental drivers, the model is spatially explicit and calculates conflict risk at a (user-specified) aggregation level. This way, the model can also capture the relevant sub-national variability of conflict and conflict drivers. Model robustness is determined using a split-sample test where a part of the data is used to train the model, while the other part is used to evaluate the outcome. Throughout this process, the geographical unit is tracked to be able to map the resulting conflict risk to the correct areas.

In addition to the calculation of conflict risk, can the model, for instance, be used to make scenario projections, evaluate the relative feature importances, or benchmark different datasets.

All in all, CoPro supports the mapping of current and future areas at risk of conflict, while also facilitating obtaining a better understanding of the underlying processes.

Installation

To install copro, first clone the code from GitHub. It is advised to create an individual python environment first. You can then install the model package into this environment.

To do so, you need to have Anaconda or Miniconda installed. For installation guidelines, see here.

$ git clone https://github.com/JannisHoch/copro.git
$ cd path/to/copro
$ conda env create -f environment.yml
$ conda activate copro

To install CoPro in editable mode in this environment, run this command next in the CoPro-folder:

$ pip install -e .

When using Jupyter Notebook, it can be handy to have the copro environment available. It can be installed into Jupyter Notebook with the following command:

$ python -m ipykernel install --name=copro

Command-line script

To be able to run the model, the conda environment has to be activated first.

$ conda activate copro

To run the model from command line, a command line script is provided. The usage of the script is as follows:

Usage: copro_runner [OPTIONS] CFG

Main command line script to execute the model.
All settings are read from cfg-file.
One cfg-file is required argument to train, test, and evaluate the model.
Multiple classifiers are trained based on different train-test data combinations.
Additional cfg-files for multiple projections can be provided as optional arguments, whereby each file corresponds to one projection to be made.
Per projection, each classifiers is used to create separate projection outcomes per time step (year).
All outcomes are combined after each time step to obtain the common projection outcome.

Args:     CFG (str): (relative) path to cfg-file

Options:
-plt, --make_plots        add additional output plots
-v, --verbose             command line switch to turn on verbose mode

This help information can be also accessed with

$ copro_runner --help

All data and settings are retrieved from the settings-file (cfg-file) which needs to be provided as inline argument.

In case issues occur, updating setuptools may be required.

$ pip3 install --upgrade pip setuptools

Example data

Example data for demonstration purposes can be downloaded from Zenodo. To facilitate this process, the bash-script download_example_data.sh can be called in the example folder under /_scripts.

With this (or other) data, the provided configuration-files (cfg-files) can be used to perform a reference run or a projection run. All output is stored in the output directory specified in the cfg-files. In the output directory, two folders are created: one name _REF for output from the reference run, and _PROJ for output for projections.

Jupyter notebooks

There are multiple jupyter notebooks available to guide you through the model application process step-by-step.

It is possible to execute the notebooks cell-by-cell and explore the full range of possibilities. Note that in this case the notebooks need to be run in the right order as some temporary files will be saved to file in one notebook and loaded in another! This is due to the re-initalization of the model at the beginning of each notebook and resulting deletion of all files in existing output folders.

The notebooks are also used to exemplify the Workflow of CoPro.

Command-line

While the notebooks are great for exploring, the command line script is the envisaged way to use CoPro.

To only test the model for the reference situation and one projection, the cfg-file for the reference run is the required argument. This cfg-file needs to point to the cfg-file of the projection in turn.

$ cd path/to/copro/example
$ copro_runner example_settings.cfg

Alternatively, the same commands can be executed using a bash-file.

$ cd path/to/copro/example/_scripts
$ sh run_command_line_script.sh

Validation

The reference model makes use of the UCDP Georeferenced Event Dataset for observed conflict. The selected classifier is trained and validated against this data.

Main validation metrics are the ROC-AUC score as well as accuracy, precision, and recall. All metrics are reported and written to file per model evaluation.

With the example data downloadable from Zenodo, a ROC-AUC score of above 0.8 can be obtained. Note that with additional and more explanatory sample data, the score will most likely increase.

Additional ways to validate the model are showcased in the Workflow.

Documentation

Extensive model documentation including full model API description can be found at http://copro.rtfd.io/

Code of conduct and Contributing

The project welcomes contributions from everyone! To make collaborations as pleasant as possible, we expect contributors to the project to abide by the Code of Conduct.

License

CoPro is released under the MIT license.

Authors

Jannis M. Hoch (Utrecht University)
Sophie de Bruin (Utrecht University, PBL)
Niko Wanders (Utrecht University)

Corresponding author: Jannis M. Hoch ([email protected])

copro's People

Contributors

Stargazers

Watchers

copro's Issues

add paper

follow JOSS guidelines

indicate plots better

indicte better in plots of leave-one-out validation which var was left out or if all data is used (add to plot title)

add tests

for JOSS and Travis-CI, we need some test functions

implement cross validation

implement k-fold cross validation to assess robustness of models in terms of over- and underfitting.

see https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score

joss/editing

"The main goal of this model is to apply machine learning techniques to make projections of future areas at risk." -> expand to convey what risk means here, how it is measured, etc.

i couldn't easily find how the model is validated, any results we may see, etc. i would recommend explicitly linking a bunch of that via the readme.rst

consider all polygons, and what to do if one polygon appears more than one time in y_test

due to the random sampling of data points from the X and Y arrays, it can happen that not all polygons are represented in the test sample X_test/y_test. This is not so much a problem if we look only at aggregate evaluation criteria like ROC value and such, but if we plot the polygons, some may stay empty. That's not good.

Once #61 is solved, it should be ensured that all polygons are represented

Besides, even with n=1 executions of the model, one polygon can appear multiple times in X_test/y_test. Each time, the prediction can be wrong or right, but it's most likely not always correct or wrong but changes each time. It is therefore necessary to create an overall value based on the (average?) accuracy of the prediction per polygon. Or something else - think!

model via pip

would also be cool to get the model via pip or even conda

only sub-set of conflicts if boolean operation is applied

with the current set-up, only those polygons where conflict took place remain in the dataframe. polygons without conflict don't show up anymore, thus yielding a scattered figure of polygons. this continues in subsequent functions, i.e. zonal statistics of nc-file.

would be better to keep all polygons in the dataframe, also those which are assigned a 0 in the boolean operation.

update command line script to final model functionality

once all changes to the code are made and all functionality is implemented, update the command line script.

gamma and C values in SVC

test sensitivity towards higher C and gamma values in SVC; also test poly kernel and assess sensitivity of degree values

fix output paths

the output folders with sub-sub-sub directories are left overs from past model structure.

remove this and just use one global output directory, the one specified in the cfg-file.

more click scripts via setuptools

make all scripts executable from command line, maybe even with click groups.

for example, have something like 'copro download_example', 'copro run' etc.

test model on different PC or laptop

this is requied to make sure the environment is correnctly installed and the code runs on different platforms than my laptop.

single variable model

after having checked how good the model(s) would perform if we only took randomly sampled Ys, how good would the model(s) perform if we used only one variable to predict?

is it possible to identify one (or more) variables that are really key for predicting Y? This would help, in combination with the LOO analysis, which vars are really driving conflict in our model(s) and which not.

make notebook executable in the cloud

for demonstration and teaching purposes, it would be great to be able to run the notebook in the cloud. check how thi scan be done.

add documentation

add IMAGE data

add more data, from IMAGE model

support non-selection of climate zones

the current set-up requires a selection of conflicts and water provinces for one or more Koeppen-Geiger climate zones.

It may be interesting to turn off this selection.

THus, add code to make this work, e.g. by defining 'None' in the climate section in the cfg-file.

add ROC

add roc curve and score to script: https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_roc_curve_visualization_api.html#sphx-glr-auto-examples-miscellaneous-plot-roc-curve-visualization-api-py

separate analysis for conflict predictionx

right now, we analyse model accuracy for predicted 0s and 1s.

since predicting 1s is much harder, would be interesting to see how good the model is if we only select those entries in y_test that contain a 1 and compare with y_pred/y_score.

test random forest classifiers

check random forest classifier as extra ML model, also with GridSearchCV (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

add post-processing

to make visualization etc. more straightforward, add script to facilitate post-processing.

add conflict at t-1 as sample data

the history of conflict is important. if there was conflict in the previous year, it is more likely that conflict will occur in a year as well.

define output files

what should be output:

gdf with all data points
gdf with data per polygon
evaluation dict
ROC curve
?

software installation/download --- JOSS

i am surprised that the package is 200 MB. It is unacceptably large for a package that doesn't technically ship any models but ships only the code for how to estimate and validate. i would recommend finding ways to trim this.
python setup.py develop doesn't work on windows

Traceback (most recent call last):
  File "setup.py", line 5, in <module>
    from setuptools import setup, find_packages
ImportError: cannot import name 'setup'

https://stackoverflow.com/questions/32380587/importerror-cannot-import-name-setup

add probabilty to model

add probability for ML models as output based on test data

for NuSVC, see here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html#sklearn.svm.NuSVC

for kNC, see here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.predict_proba

apply model per country

In the cfg-file, the conflicts can be filtered also for an individual country. this would reduce the number of conflicts drastically.

however, the number of water provinces is not reduced to the given country, thus introducing an immense imbalance in the model.

as solution, remove the country option in the cfg-file.

the model can the be run for an individual country by simply providing a shp-file for this given country. conflicts are then clipped with geopandas to the extent of the shp-file.

download conflict data from web

try to download the conflict data from web. chekc https://stackoverflow.com/questions/57748687/downloading-files-in-jupyter-wget-on-windows and https://stackoverflow.com/questions/7243750/download-file-from-web-in-python-3.

think of a better model name

anyone @wande001 @Sophiepieternel ?

make model output reproducible

add data to CodeOcean for docker-like environment and execution

add extra aggregation level

to compute probability of detection and false alarm over entire time series, add aggregation level (to be provided with shp-file) to model. all data points are then aggregated to this level before performing the computations.

reproducibility --- JOSS

generally, to make model outputs for random forests etc. completely reproducible, you need to set the same seed. I would recommend adding a default seed plus letting people pick a particular seed.

functions functions functions

reduce 'loose' code and put all into functions

assess model quality re. conf

we know after n model repetitions how many predicitons were made per polygon, how often the model prediction was correct, for both conflict and non-conflict as well as for conflict only.

this gives an overall good impression of model performance.

however, how do we show which polygons are now predicted to be 'at risk'? because with multiple predictions made per polygon, no every prediction is only 'conflict' or only 'non-conflict', so we will have to deal with a melange of predictions. what are good means to visualize this per polygon?

execute model n times

depending on which datapoints are selected for the training and test samples (this happens randomly!), the model output differs.

to account for this, the model would have to be run n-times (e.g. 1000) and outputs should be averaged to get a solid result.

add kappa estimate

based on Joosts feedback, add kappa value as evaluation function to model: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html#sklearn.metrics.cohen_kappa_score

add feature importance

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.feature_importances_

update analysis

remove the scatter plot of 'total_hits" and 'average_hit' because it's a no brainer... clearly, the more correct predictions you have, the higher the chance that the fraction of correct predictions is high too.
classify the polygons along two axes: 'average_hit' and 'nr_of_test_confl' and categorize polygons in 4 groups: high accuracy and many conflict samples, high accuracy and few conflict samples, low accuracy and many conflict samples and low accuracy and few conflict samples. this could bring insights where the model results are more robust than elsewhere...

support more conflict types

right now, only possible to specify only one type of conflict in cfg-file. change that such multiple values can be provided and the selectoin procedure is adapted accordingly.

more object-based programming

for long-term development, it would be useful to make more use of object-based programming.

automatically load and loop through vars

instead of having a function call per input variable in the script, it would be more efficient to have a loop over all values listed in the config-file. The user could then just specify a random amout there with file paths and the models would go though all of them.
https://stackoverflow.com/questions/22068050/iterate-over-sections-in-a-config-file

this would also need to include a detection of how the time variable in the nc-file is defined to call the right function per input file.

relative paths

Make the input and output paths in the settings file relative so they don't need to be updated per user.

use pickle or similar to load pre-computed XY

right now, a lot of waiting time is needed to produce the XY array. if the overall settings do not change, this one should not change between runs however.

it would save lot of time if we could import a pre-computed XY data and start from there.

also, this could be useful for demonstration purposes where time is limited and noone wants to wait for the looping through years and input data.

make sampling of data for ML model more generic

Currently, we create a big dataframe first with one new column per year per step. For the ML model, however, we do not need them in separate columns, but in one column per variable with entries for all years. This can perfectly be one outside of the main dataframe containing geometry information as the geometry information can be dropped for the ML model.

Thus, update all function to return only one column with the data input needed for ML and make this all more generic too, i.e. less dependent on column names.

assess relative importance/sensitivity of individual variables

how can i see what the relative influence of a input variable is in predicting our target var? is it possible to somehow get a list with ordered relative importance of the variables? or simply testing relative importance by 'leaving-one-out', i.e. running n runs (n=number of vars) which each run n-1 vars used?