Migration to Trino exploration: create the tables used in analytics in Trino

Analytics

This is the source repository for the site https://www.gbif.org/analytics.

What are the analytics?

GBIF capture various metrics to enable monitoring of data trends.

The development is being done in an open manner, to enable others to verify procedures, contribute, or fork the project for their own purposes. The results are visible on https://www.gbif.org/analytics/global and show global and country specific charts illustrating the changes observed in the GBIF index since 2007.

Please note that all samples of the index have been reprocessed to consistent quality control and to the same taxonomic backbone to enable comparisons over time. This is the first time this analysis has been possible, and is thanks to the adoption of the Hadoop environment at GBIF which enables the large scale analysis. In total there are approximately 32 billion (to January 2021) records being analysed for these reports.

Project structure

The project is divided into several parts:

Hive and Sqoop scripts which are responsible for importing historical data from archived MySQL database dumps
Hive scripts that snapshotted data from the message-based real-time indexing system which served GBIF between late 2013 and Q3 2019
Hive scripts that snapshot recent data from the latest GBIF infrastructure (the real time indexing system currently serving GBIF)
Hive scripts that process all data to the same quality control and taxonomic backbone
Hive scripts that digest the data into specific views suitable for download from Hadoop and further processing
R and Python scripts that process the data into views per country
R and Python scripts that produce the static charts for each country

Setup

These steps are required for a new environment. It is probably easiest to use the Docker image.

Install the yum packages R, cairo, cairo-devel
Run Rscript R/install-packages.R (Possibly it is necessary to set the R_LIBS_USER environment variable.)

Steps for adding a new snapshot and then re-running the processing

This will only work on a Cloudera Manager managed gateway such as c5gateway-vh on which you should be able to sudo -i -u hdfs and find the code in /home/hdfs/analytics/ (do a git pull)
Make sure Hadoop libraries and binaries (e.g. hive) are on your path
The snapshot name will be the date as YYYYMMDD so e.g. 20140923.
Create new "raw" table from the HDFS table using hive/import/hdfs/create_new_snapshot.sh. Pass in snapshot database, snapshot name, source Hive database and source Hive table e.g. cd hive/import/hdfs; ./create_new_snapshot.sh snapshot $(date +%Y%m%d) prod_h occurrence
Tell Matt he can run the backup script, which exports these snapshots to external storage.
Add the new snapshot name to the hive/normalize/build_raw_scripts.sh script, to the array hdfs_v1_snapshots. If the HDFS schema has changed you'll have to add a new array called e.g. hdfs_v2_snapshots and add logic to process that array at the bottom of the script (another loop).
Add the new snapshot name to hive/normalize/create_occurrence_tables.sh in the same way as above.
Add the new snapshot name to hive/process/build_prepare_script.sh in the same way as above.
Replace the last element of temporalFacetSnapshots in R/graph/utils.R with your new snapshot. Follow the formatting in use, e.g. 2015-01-19
Make sure the version of EPSG used in the latest occurrence project pom.xml is the same as the one that the script hive/normalize/create_tmp_interp_tables.sh fetches. Do that by checking the pom.xml (hopefully still at: https://github.com/gbif/occurrence/blob/master/pom.xml) for the geotools.version. That version should be the same as what's in the shell script (at time of writing the geotools.version was 20.5 and the script line was curl -L 'http://download.osgeo.org/webdav/geotools/org/geotools/gt-epsg-hsql/20.5/gt-epsg-hsql-20.5.jar' > /tmp/gt-epsg-hsql.jar)
Set up additional geocode services (e.g. using UAT or Dev, or duplicates running in prod). There need to be as many backends connections available as tasks will run in YARN.
From the root (analytics) directory you can now run the build.sh script to run all the HBase and Hive table building, build all the master CSV files, which are in turn processed down to per country/region CSVs and GeoTIFFs, then generate the maps and figures needed for the website and the country reports. Note that this will take up to 48 hours and is unfortunately error prone, so all steps could also be run individually. In any case it's probably best to run all parts of this script on a machine in the secretariat and ideally in a "screen" session. To run it all do:

screen -L -S analytics
./build.sh -interpretSnapshots -summarizeSnapshots -downloadCsvs -processCsvs -makeFigures

(Detach from the screen with "^A d", reattach with screen -x.)

rsync the CSVs, GeoTIFFs, figures and maps to [email protected]:/var/www/html/analytics-files/ and check (this server is also used for gbif-dev.org) rsync -avn report/ [email protected]:/var/www/html/analytics-files/ rsync -avn registry-report/ [email protected]:/var/www/html/analytics-files/registry/

Steps to build country reports after the R part is done

Check the download statistics are up-to-date (Nagios should be alerting if not, but https://api.gbif.org/v1/occurrence/download/statistics/downloadedRecordsByDataset?fromDate=2023-03). If not, update with https://github.com/gbif/registry/blob/master/populate_downloaded_records_statistics.sh
Generate the country reports — check you are using correct APIs! (Normally prod but UAT analytics assets.) Instructions are in the country-reports project.
rsync the reports to [email protected]:/var/www/html/analytics-files/ rsync -av country-report/ [email protected]:/var/www/html/analytics-files/country/

Steps to deploy to production

rsync the CSVs, GeoTIFFs, figures and maps to [email protected]:/var/www/html/analytics-files/ rsync -avn report/ [email protected]:/var/www/html/analytics-files/ rsync -avn registry-report/ [email protected]:/var/www/html/analytics-files/registry/
rsync the reports to [email protected]:/var/www/html/analytics-files/ rsync -av country-report/ [email protected]:/var/www/html/analytics-files/country/
Check https://www.gbif.org/analytics, write an email to [email protected] giving heads up on the new data, and accept the many accolades due your outstanding achievement in the field of excellence!
Archive the new analytics. The old analytics files have been used several times by the communications team:

cd /var/www/html/
tar -cvJf /mnt/auto/analytics/archives/gbif_analytics_$(date +%Y-%m-01).tar.xz --exclude favicon.ico --exclude '*.pdf' analytics-files/[a-z]*
# or at the start of the year, when the country reports have been generated:
tar -cvJf /mnt/auto/analytics/archives/gbif_analytics_$(date +%Y-%m-01).tar.xz --exclude favicon.ico analytics-files/[a-z]*

Then upload this file to Box.

Copy only the CSVs and GeoTIFFs to the public, web archive:

rsync -rtv /var/www/html/analytics-files/[a-z]* /mnt/auto/analytics/files/$(date +%Y-%m-01) --exclude figure --exclude map --exclude '*.pdf' --exclude favicon.ico
cd /var/www/html/analytics-files
ln -s /mnt/auto/analytics/files/$(date +%Y-%m-01) .

Verify the display of this at https://analytics-files.gbif.org/

Acknowledgements

The work presented here is not new, and builds on ideas already published. In particular the work of Javier Otegui, Arturo H. Ariño, María A. Encinas, Francisco Pando (https://doi.org/10.1371/journal.pone.0055144) was used as inspiration during the first development iteration, and Javier Otegui kindly provided a crash course in R to kickstart the development.

gbif / analytics Goto Github PK

analytics's Introduction

Analytics

What are the analytics?

Project structure

Setup

Steps for adding a new snapshot and then re-running the processing

Steps to build country reports after the R part is done

Steps to deploy to production

Acknowledgements

analytics's People

Contributors

Stargazers

Watchers

Forkers

analytics's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs