open-aims / adria-synthetic-data Goto Github PK

View Code? Open in Web Editor NEW

0.0 4.0 0.0 4.15 MB

Repository for the creation of synthetic input data layers for ADRIA.

License: MIT License

Python 100.00%

adria-synthetic-data's Introduction

ADRIA-synthetic-data

Repository for the creation of synthetic input data layers for ADRIA.

Set-up

Create the environment by running,

conda env create -f ADRIA_synth_data_env.yml

This environment can then be selected in your Python editor of choice.

Add the original data package you want to create synthetic data off to the original_data folder.

Creating site data

Synthetic site data can be generated from the site-data-generation.py file in the examples folder. Add the name of your chosen original data package at the top of the file as orig_data_package = "name of file". Adjust the parameters N1, N2 and N3 as desired also. N1 is the number of unconditionalised samples to generate. N2 is the final number of spatially conditionalised sites to generate. N3 is the number of nodes to generate the final site positions in randomised radii around. Using the site data model automatically creates the synthetic data package and the package name will be given in the modal outputs as synth_site_data_fn.

Creating initial coral cover data

Synthetic initial coal cover data can be generated from the coral-cover-generation.py file in the examples folder. Add the name of your chosen original data package at the top of the file as orig_data_package = "name of file" and the name of the synthetic site data file you want to base the cover data on: e.g. root_site_data_synth = "synth_2023-7-24_152038.csv".

Creating environmental data

Synthetic wave and DHW data can be generated from the env-data-generation.py file in the examples folder. Several inputs at the beginning of the file can be changed to adjust the output of the data model. An example is shown below:

layer = "Ub"
rcp = "45"
root_original_file = "name of file"
root_site_data_synth = "synth_2023-7-24_152038.csv"
nsamples = 10
nreplicates = 5

The layer variable designates the type of data to generate, so dhw for DHW data and Ub for wave data. rcp is the RCP to use to generate data from in the original data file. nsamples is the number of samples to generate from each climate replicate. nreplicates is the number of climate replicates to use from the original dataset. In the example above, the final dataset will have 10*5 replicates based on 5 replicates from the original dataset.

Creating connectivity data

Synthetic connectivity data can be generated from the connectivity-generation.py file in the examples folder. Several inputs at the beginning of the file can be changed to adjust the output of the data model. An example is shown below:

root_original_file = "name of file"
root_site_data_synth = "synth_2023-7-24_152038.csv"
years = ["2015", "2016", "2017"]  # connectivity data years to use
num = ["1", "2", "3"]  # connectivity data sample number to use
model_type = "GAN"  # "GaussianCopula"

years designates how many years to base the synthestic connectivity dataset off, with the average being used to generate the final dataset. num designates any replicates to be used in each year (also averaged over). model_type designates whether to use the GAN model from tensorflow/keras (slower but better quality) or "Gaussian Copula" model from SDV (faster but lesser quality).

Creating a synthetic JSON for the data package

A synthetic JSON file can be added to the synthetic data package using the file data-package-json-creation.py.

Creating the whole data package

The whole data package can be created by running data-package-creation.py. The same inputs will need to be adjusted in this file as in the files described above, namely the sample number parameters and original dataset file name etc.

adria-synthetic-data's People

Contributors

Watchers

adria-synthetic-data's Issues

New site polygons overlap

Currently, synthetic site polygons (which are created as circles around site centroids satisfying the synthetically generated areas) may overlap. This should be fixed to avoid double counting when it comes to coral cover and site areas.

Increase range for random translation when anonymising synthetic site data positions

Currently synthetic site data positions are translated in a randomised distance within a 2000km by 2000km box. Increase this size for better anonymity.

Compartmentalise data pre-processing and post-processing into functions

synthetic data models require data in a certain form.
currently transformations are done in data fitting files, but should be separated into their own functions for re-use

datapackage.json should be constructed not made from altered original data file

Currently the datapackage.json file which accompanies each data package is made by altering a copy of the original data package. On discussion it was agreed it is better to create one from scratch to avoid back tracking.

Create functions to automate visual comparisons of synthetic and original datasets

currently no function wrappers for visualisations comparing synthetic and original datasets
create functions which alllow easy comparisons between synthetic and original data, both for tabular, spatial and temporal datasets.

Create better Readme.md for this repo

README should have quick set up and examples added.

Test models for creation of synthetic wave data

currently models have been created for synthetic site data, connectivity and dhw data.
wave data is also a necessary input layer for the ADRIA model
first try out temporal model from SDV package, then try out models from Deep Echo package if no success.

Compare current connectivity model to simple probabilistic model

Currently, the GAN model for connectivity has a reasonably long run time and is prone to mode collapse. Compare this with simple probabilistic model by fitting Gaussians over several layers of connectivity data, run for multiple iterations and take median of outcomes.

site data in ADRIA is now in a geopackage format
synthetic site data cannot be created with geometries easily (synthetic data packages cannot easily create the geometry column).
resolve this maybe with randomised placeholder geometry

Add k-neighbours median when doing nearest neighbours interpolation

Currently using first nearest neighbour for spatial interpolation of connectivity and dhw data, but could use nearest k median for better interpolation.

Create function which automates creation of ADRIA data package

Currently dhw, connectivity and site data is created separately.
ADRIA requires input data in a particular datapackage format
create function which automatically organises input data files in this structure, given the locations of created synthetic data files as input