An R workflow for curation of Philippine Atmospheric, Geophysical, and Astronomical Services Administration (PAGASA) datasets

This repository is a docker-containerised, {targets}-based, {renv}-enabled R workflow for the retrieval, processing, and curation of various Philippine Atmospheric, Geophysical, and Astronomical Services Administration (PAGASA) publicly available datasets.

Why `paglaom`?

The word paglaom (pronounced as /paɡˈlaʔom/, [pʌɡˈl̪a.ʔɔm]) is Bisaya (one of up to 187 languages spoken in the Philippines in addition to Filipino, which is the national language, and English, which is the language of instruction in the country) for hope. PAGASA, the national meteorological and hydrological services agency of the Philippines, draws its name from the Filipino word pag-asa which means hope. The repository name choice, hence, is a play on these words and also a way to showcase the richness and diversity that exists in the Pilippines.

The paglaom project aims to maintain a database of curated datasets on varios atmospheric, geophysical, and astronomical phenomena that are made publicly available by PAGASA on their website. These datasets tend to be summaries of the multitude of data that PAGASA collects on a high frequency basis. They also tend to be in formats that are not machine-readable (e.g., PDF, PNG, HTML formats) meant for reporting to the Philippine population rather than actual datasets that are usable for academic and/or professional research. PAGASA does provide more granular and expansive datasets for research purposes through a specific data request process. The paglaom project doesn’t aim to perform research on the summarised datasets provided publicly on the PAGASA website. Rather, the project aims to showcase publicly avaialble data that can be used for educational purposes some of which are:

for students who need to make a report on topics covered by PAGASA’s summarised data for a school assignment or project;
for individuals who have specific interest in one of the natural phenomena that PAGASA monitors and would like to get raw summarised data in a format that is usable and transferrable into other formats;
for data visualisation learners and aficionados who want to try on working on data about the various natural phenomena available from PAGASA and create unique and interesting plots and graphics.

The broader and more blue skies vision of the paglaom project is to contribute to the increasing interest in science, technology, engineering, and mathematics (STEM) sciences particularly in the Philippines with a collection that showcases topics and data that are homegrown and embedded into the fabric of Philippine life.

Whilst the paglaom project by its name and the nature of the data it curates has an inherent Filipino audience, it is hoped that those outside of the Philippines will also find the information within useful in similar contexts described above.

Repository Structure

The project repository is structured as follows:

paglaom
    |-- .github/
    |-- data/
    |-- data-raw/
    |-- outputs/
    |-- R/
    |-- reports
    |-- renv
    |-- renv.lock
    |-- .Rprofile
    |-- packages.R
    |-- _targets_climate.R
    |-- _targets_cyclones.R
    |-- _targets_heat.R
    |-- _targets_setup.R
    |-- _targets.R
    |-- _targets.yaml

.github contains project testing and automated deployment of outputs workflows via continuous integration and continuous deployment (CI/CD) using Github Actions.
data/ contains intermediate and final data outputs produced by the workflow.
data-raw/ contains raw datasets, usually either downloaded from source or added manually, that are used in the project. This directory is empty given that the raw datasets used in this project are restricted and are only distributed to eligible members of the project. This directory is kept here to maintain reproducibility of project directory structure and ensure that the workflow runs as expected.
outputs/ contains compiled reports and figures produced by the workflow.
R/ contains functions developed/created specifically for use in this workflow.
reports/ contains literate code for R Markdown reports rendered in the workflow.
renv/ contains renv package specific files and directories used by the package for maintaining R package dependencies within the project. The directory renv/library, is a library that contains all packages currently used by the project. This directory, and all files and sub-directories within it, are all generated and managed by the renv package. Users should not change/edit these manually.
renv.lock file is the renv lockfile which records enough metadata about every package used in this project that it can be re-installed on a new machine. This file is generated by the renv package and should not be changed/edited manually.
.Rprofile file is a project R profile generated when initiating renv for the first time. This file is run automatically every time R is run within this project, and renv uses it to configure the R session to use the renv project library.
packages.R file lists out all R package dependencies required by the workflow.
_targets*.R files define the steps in the workflow’s data ingest, data processing, data analysis, and reporting pipelines.
_targets.yaml file defines the different targets sub-projects within this project.

The workflow

Currently, the project curates the following datasets:

Tropical cyclones data for various cyclones entering the Philippine area of responsibility since 2017;
Daily heat index data from various data collection points in the Philippines;
Climatological extremes and normals data over time; and,
Daily dam water level data.

Reproducibility

R package dependencies

This project was built using R 4.3.3. This project uses the renv framework to record R package dependencies and versions. Packages and versions used are recorded in renv.lock and code used to manage dependencies is in renv/ and other files in the root project directory. On starting an R session in the working directory, run renv::restore() to install R package dependencies.

read PNG and extract data

something like this:

library(magick)
#> Linking to ImageMagick 7.1.0.31
#> Enabled features: cairo, fontconfig, freetype, heic, lcms, pango, raw, rsvg, webp, x11
#> Disabled features: fftw, ghostscript
#> Using 12 threads
library(tesseract)
input <- image_read("https://i.stack.imgur.com/JxGHc.png") %>% 
  # preprocess image to make it easier to ocr
  image_convert(type = 'Grayscale') %>% 
  image_deskew() %>% 
  image_resize("2000x") %>% 
  ocr()

df <- data.table::fread(text = input)
#> Warning in data.table::fread(text = input): Detected 11 column names but the
#> data has 12 columns (i.e. invalid file). Added 1 extra default column name for
#> the first column which is guessed to be row names or an index. Use setnames()
#> afterwards if this guess is not correct, or fix the file write command that
#> created the file to create a valid file.
df
#>     V1     info    tmax ACREAGE                               GLOBALID
#>  1:  1 PRISM_tm 30.3976  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#>  2:  2 PRISM_tm 26.0226  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#>  3:  3 PRISM_tm 27.1775  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#>  4:  4 PRISM_tm  24,164  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#>  5:  5 PRISM_tm  24.458  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#>  6:  6 PRISM_tm  26.118  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#>  7:  7 PRISM_tm  27.259  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#>  8:  8 PRISM_tm  30.105  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#>  9:  9 PRISM_tm  30.697  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 10: 10 PRISM_tm   32949  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 11: 11 PRISM_tm  32,966  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 12: 12 PRISM_tm  32.081  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 13: 13 PRISM_tm  29.847  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 14: 14 PRISM_tm  27.576  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 15: 15 PRISM_tm  24.671  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 16: 16 PRISM_tm  24.382  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 17: 17 PRISM_tm  24.382  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 18: 18 PRISM_tm  26.365  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 19: 19 PRISM_tm  29.246  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 20: 20 PRISM_tm  30.737  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 21: 21 PRISM_tm  31.658  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 22: 22 PRISM_tm  31.386  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 23: 23 PRISM_tm   32457  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 24: 24 PRISM_tm  32.093  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 25: 25 PRISM_tm  30.303  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 26: 26 PRISM_tm  26.231  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#> 27: 27 PRISM_tm  25.956  783805 {257865E5-DA82-41F8-B679-169C60B2BB4D}
#>     V1     info    tmax ACREAGE                               GLOBALID
#>     datasource variable    datatype resolutior    Date year month
#>  1:      PRISM     tmax provisional      4kmM3 2021-10 2021    10
#>  2:      PRISM     tmax provisional      4kmM3 2021-11 2021    11
#>  3:      PRISM     tmax provisional      4kmM3 2021-12 2021    12
#>  4:      PRISM     tmax      stable      4kmM3 2005-01 2005     1
#>  5:      PRISM     tmax      stable      4kmM3 2005-02 2005     2
#>  6:      PRISM     tmax      stable      4kmM3 2005-03 2005     3
#>  7:      PRISM     tmax      stable      4kmM3 2005-04 2005     4
#>  8:      PRISM     tmax      stable      4kmM3 2005-05 2005     5
#>  9:      PRISM     tmax      stable      4kmM3 2005-06 2005     6
#> 10:      PRISM     tmax      stable      4kmM3 2005-07 2005     7
#> 11:      PRISM     tmax      stable      4kmM3 2005-08 2005     8
#> 12:      PRISM     tmax      stable      4kmM3 2005-09 2005     9
#> 13:      PRISM     tmax      stable      4kmM3 2005-10 2005    10
#> 14:      PRISM     tmax      stable      4kmM3 2005-11 2005    11
#> 15:      PRISM     tmax      stable      4kmM3 2005-12 2005    12
#> 16:      PRISM     tmax      stable      4kmM3 2006-01 2006     1
#> 17:      PRISM     tmax      stable      4kmM3 2006-02 2006     2
#> 18:      PRISM     tmax      stable      4kmM3 2006-03 2006     3
#> 19:      PRISM     tmax      stable      4kmM3 2006-04 2006     4
#> 20:      PRISM     tmax      stable      4kmM3 2006-05 2006     5
#> 21:      PRISM     tmax      stable      4kmM3 2006-06 2006     6
#> 22:      PRISM     tmax      stable      4kmM3 2006-07 2006     7
#> 23:      PRISM     tmax      stable      4kmM3 2006-08 2006     8
#> 24:      PRISM     tmax      stable      4kmM3 2006-09 2006     9
#> 25:      PRISM     tmax      stable      4kmM3 2006-10 2006    10
#> 26:      PRISM     tmax      stable      4kmM3 2006-11 2006    11
#> 27:      PRISM     tmax      stable      4kmM3 2006-12 2006    12
#>     datasource variable    datatype resolutior    Date year month

from https://stackoverflow.com/questions/73238598/r-extract-text-from-image-and-export-it-as-a-csv

panukatan / paglaom Goto Github PK

paglaom's Introduction

An R workflow for curation of Philippine Atmospheric, Geophysical, and Astronomical Services Administration (PAGASA) datasets

Why paglaom?

Repository Structure

The workflow

Reproducibility

R package dependencies

paglaom's People

Contributors

Stargazers

Watchers

paglaom's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Why `paglaom`?