GithubHelp home page GithubHelp logo

gbif / analytics Goto Github PK

View Code? Open in Web Editor NEW
8.0 23.0 3.0 33.66 MB

The analytics scripts used to calculate and generate all time series (etc.) graphs

Home Page: https://www.gbif.org/analytics/global

License: Apache License 2.0

R 29.54% Shell 24.38% Perl 5.37% HiveQL 34.95% Dockerfile 0.22% Python 5.47% Emacs Lisp 0.06%
gbif analytics snapshot

analytics's Introduction

Analytics

This is the source repository for the site https://www.gbif.org/analytics.

DOI

What are the analytics?

GBIF capture various metrics to enable monitoring of data trends.

The development is being done in an open manner, to enable others to verify procedures, contribute, or fork the project for their own purposes. The results are visible on https://www.gbif.org/analytics/global and show global and country specific charts illustrating the changes observed in the GBIF index since 2007.

Please note that all samples of the index have been reprocessed to consistent quality control and to the same taxonomic backbone to enable comparisons over time. This is the first time this analysis has been possible, and is thanks to the adoption of the Hadoop environment at GBIF which enables the large scale analysis. In total there are approximately 32 billion (to January 2021) records being analysed for these reports.

Project structure

The project is divided into several parts:

  • Hive and Sqoop scripts which are responsible for importing historical data from archived MySQL database dumps
  • Hive scripts that snapshotted data from the message-based real-time indexing system which served GBIF between late 2013 and Q3 2019
  • Hive scripts that snapshot recent data from the latest GBIF infrastructure (the real time indexing system currently serving GBIF)
  • Hive scripts that process all data to the same quality control and taxonomic backbone
  • Hive scripts that digest the data into specific views suitable for download from Hadoop and further processing
  • R and Python scripts that process the data into views per country
  • R and Python scripts that produce the static charts for each country

Setup

These steps are required for a new environment. It is probably easiest to use the Docker image.

  • Install the yum packages R, cairo, cairo-devel
  • Run Rscript R/install-packages.R (Possibly it is necessary to set the R_LIBS_USER environment variable.)

Steps for adding a new snapshot and then re-running the processing

  • This will only work on a Cloudera Manager managed gateway such as c5gateway-vh on which you should be able to sudo -i -u hdfs and find the code in /home/hdfs/analytics/ (do a git pull)
  • Make sure Hadoop libraries and binaries (e.g. hive) are on your path
  • The snapshot name will be the date as YYYYMMDD so e.g. 20140923.
  • Create new "raw" table from the HDFS table using hive/import/hdfs/create_new_snapshot.sh. Pass in snapshot database, snapshot name, source Hive database and source Hive table e.g. cd hive/import/hdfs; ./create_new_snapshot.sh snapshot $(date +%Y%m%d) prod_h occurrence
  • Tell Matt he can run the backup script, which exports these snapshots to external storage.
  • Add the new snapshot name to the hive/normalize/build_raw_scripts.sh script, to the array hdfs_v1_snapshots. If the HDFS schema has changed you'll have to add a new array called e.g. hdfs_v2_snapshots and add logic to process that array at the bottom of the script (another loop).
  • Add the new snapshot name to hive/normalize/create_occurrence_tables.sh in the same way as above.
  • Add the new snapshot name to hive/process/build_prepare_script.sh in the same way as above.
  • Replace the last element of temporalFacetSnapshots in R/graph/utils.R with your new snapshot. Follow the formatting in use, e.g. 2015-01-19
  • Make sure the version of EPSG used in the latest occurrence project pom.xml is the same as the one that the script hive/normalize/create_tmp_interp_tables.sh fetches. Do that by checking the pom.xml (hopefully still at: https://github.com/gbif/occurrence/blob/master/pom.xml) for the geotools.version. That version should be the same as what's in the shell script (at time of writing the geotools.version was 20.5 and the script line was curl -L 'http://download.osgeo.org/webdav/geotools/org/geotools/gt-epsg-hsql/20.5/gt-epsg-hsql-20.5.jar' > /tmp/gt-epsg-hsql.jar)
  • Set up additional geocode services (e.g. using UAT or Dev, or duplicates running in prod). There need to be as many backends connections available as tasks will run in YARN.
  • From the root (analytics) directory you can now run the build.sh script to run all the HBase and Hive table building, build all the master CSV files, which are in turn processed down to per country/region CSVs and GeoTIFFs, then generate the maps and figures needed for the website and the country reports. Note that this will take up to 48 hours and is unfortunately error prone, so all steps could also be run individually. In any case it's probably best to run all parts of this script on a machine in the secretariat and ideally in a "screen" session. To run it all do:
screen -L -S analytics
./build.sh -interpretSnapshots -summarizeSnapshots -downloadCsvs -processCsvs -makeFigures

(Detach from the screen with "^A d", reattach with screen -x.)

  • rsync the CSVs, GeoTIFFs, figures and maps to [email protected]:/var/www/html/analytics-files/ and check (this server is also used for gbif-dev.org) rsync -avn report/ [email protected]:/var/www/html/analytics-files/ rsync -avn registry-report/ [email protected]:/var/www/html/analytics-files/registry/

Steps to build country reports after the R part is done

Steps to deploy to production

  • rsync the CSVs, GeoTIFFs, figures and maps to [email protected]:/var/www/html/analytics-files/ rsync -avn report/ [email protected]:/var/www/html/analytics-files/ rsync -avn registry-report/ [email protected]:/var/www/html/analytics-files/registry/
  • rsync the reports to [email protected]:/var/www/html/analytics-files/ rsync -av country-report/ [email protected]:/var/www/html/analytics-files/country/
  • Check https://www.gbif.org/analytics, write an email to [email protected] giving heads up on the new data, and accept the many accolades due your outstanding achievement in the field of excellence!
  • Archive the new analytics. The old analytics files have been used several times by the communications team:
cd /var/www/html/
tar -cvJf /mnt/auto/analytics/archives/gbif_analytics_$(date +%Y-%m-01).tar.xz --exclude favicon.ico --exclude '*.pdf' analytics-files/[a-z]*
# or at the start of the year, when the country reports have been generated:
tar -cvJf /mnt/auto/analytics/archives/gbif_analytics_$(date +%Y-%m-01).tar.xz --exclude favicon.ico analytics-files/[a-z]*

Then upload this file to Box.

  • Copy only the CSVs and GeoTIFFs to the public, web archive:
rsync -rtv /var/www/html/analytics-files/[a-z]* /mnt/auto/analytics/files/$(date +%Y-%m-01) --exclude figure --exclude map --exclude '*.pdf' --exclude favicon.ico
cd /var/www/html/analytics-files
ln -s /mnt/auto/analytics/files/$(date +%Y-%m-01) .

Acknowledgements

The work presented here is not new, and builds on ideas already published. In particular the work of Javier Otegui, Arturo H. Ariño, María A. Encinas, Francisco Pando (https://doi.org/10.1371/journal.pone.0055144) was used as inspiration during the first development iteration, and Javier Otegui kindly provided a crash course in R to kickstart the development.

analytics's People

Contributors

fmendezh avatar jlegind avatar kcopas avatar mattblissett avatar mdoering avatar omeyn avatar timrobertson100 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

analytics's Issues

Generate SVG graphs

In comparison to the new occurrence search charts, these PNGs look a bit fuzzy when scaled.

Possibly this is as simple as adding device="svg" to utils.R (or reworking to keep the PNGs as well as the SVGs).

Alternatively, SVGs could be made in the portal from the CSVs, but it's probably better to keep all the logic in the R scripts.

Date interpretation depends on the current date, not the snapshot date

Collection/event date interpretation has an upper bound of tomorrow (to take account of timezones) during normal interpretation: https://github.com/gbif/occurrence/blob/occurrence-0.154/occurrence-processor/src/main/java/org/gbif/occurrence/processor/interpreting/TemporalInterpreter.java#L65

"Tomorrow" is still used when reinterpreting snapshots. The snapshot from 2013-12-11 has this data:

Raw                     As reinterpreted on 2021-01-15
2021    1       1       2021    1       1
2021    1       7       2021    1       7
2021    1       9       2021    1       9
2021    1       14      2021    1       14
2021    1       18      NULL    NULL    NULL
2021    1       27      NULL    NULL    NULL
2021    1       27      NULL    NULL    NULL
2021    1       29      NULL    NULL    NULL
2021    1       30      NULL    NULL    NULL

These dates are all bad (the occurrences cannot have been collected/observed 8 years in the future), but we will reinterpret them as correct in the April analytics run.

This is not a unique case, e.g. most (all?) snapshots have at least one occurrence with an event year of 2021:

"2007-12-19",2021,1
"2008-04-01",2021,1
"2008-06-27",2021,1
"2008-10-10",2021,1
"2008-12-17",2021,1
"2009-04-06",2021,1
"2009-06-17",2021,1
"2009-09-25",2021,1
"2009-12-16",2021,1
"2010-04-01",2021,1
"2010-07-26",2021,1
"2010-11-17",2021,1
"2011-02-21",2021,2
"2011-06-10",2021,1
"2011-09-05",2021,1
"2012-01-18",2021,1
"2012-03-26",2021,1
"2012-07-13",2021,1
"2012-10-31",2021,144
"2012-12-11",2021,148
"2013-02-20",2021,143
"2013-05-21",2021,14
"2013-07-09",2021,12
"2013-09-10",2021,12
"2013-12-20",2021,12
"2014-03-28",2021,12
"2014-09-08",2021,12
"2015-01-19",2021,12
"2015-04-09",2021,12
"2015-07-03",2021,3
"2015-10-01",2021,3
"2016-01-04",2021,3
"2016-04-05",2021,2
"2016-07-04",2021,3
"2016-10-07",2021,2
"2016-12-27",2021,2
"2017-04-12",2021,3
"2017-07-24",2021,3
"2017-10-12",2021,14
"2017-12-22",2021,14
"2018-04-09",2021,23
"2018-07-11",2021,23
"2018-09-28",2021,20
"2019-01-01",2021,20
"2019-04-06",2021,23
"2019-07-01",2021,22
"2019-10-09",2021,22
"2020-01-01",2021,14
"2020-04-01",2021,15
"2020-07-01",2021,11
"2020-10-01",2021,11
"2021-01-01",2021,120

This prevents rerunning the analytics and getting the same figures.

Fix shading/alignment on Table 4

Comment by Mélianie Raymond, 9 Feb 2016

The shading on the table on the last page has gone out of alignment – maybe as a result of countries having gone up or down in their placement?

vega-lite could be a promising direction for this page

vega-lite
https://vega.github.io/vega-lite/

I made this graph directly using a csv provided by the analytics output:
occ_kingdom_basisOfRecord.csv

https://blockbuilder.org/jhnwllr/c9461b61f2dcd938d38909bab91c5344

All that was necessary was this json specification file.

{
  "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
  "description": "A simple bar chart with embedded data.",
  "width": 400,
  "height":300,
  "data": {"url":"https://gist.githubusercontent.com/jhnwllr/f9eee906345f1cb53b8931347d57fa22/raw/2198298239da88b8db40aa98cc0a7ce47613f4d6/occ_kingdom_basisOfRecord.csv"},
"format": {"type": "csv"},  
"mark": "bar",
  "encoding": {
    "x": {"field": "snapshot", "type": "ordinal"},
    "y": {"field": "occurrenceCount", "type": "quantitative"},
    "tooltip": {"field": "basisOfRecord", "type": "nominal"},
    "color": {
      "field": "basisOfRecord",
      "type": "nominal",
      "scale": {
        "domain": 
       ["PRESERVED_SPECIMEN", 
       "UNKNOWN", 
       "FOSSIL_SPECIMEN",
       "LIVING_SPECIMEN",
       "HUMAN_OBSERVATION",
       "OBSERVATION",
       "LITERATURE",
       "MACHINE_OBSERVATION",
       "MATERIAL_SAMPLE"],
       "range": ["#c00000", "#ff4f00", "#fcff00", "#0c8900", "#0085ff","#aab8ab","#5da9b7"]
      },
       "order": {"aggregate": "sum", "field": "occurrenceCount", "type": "quantitative"}
 }
}

Enrich species counts in analytics

From Christoph H.:

It would be helpful in country reports to have metrics on the number of species having occurrence for my country

We do track a metric per kingdom per country, but presumably, the desire is to e.g. allow a browsable taxonomy with counts down to the family level or so.

This may be something for the analytics if we want to track that over time (seems useful to demonstrate growth in coverage) or perhaps better for the country pages.

Edited to add: example of the metrics we track (from Germany page):

image

GBIF Sweden gap analysis - enable analysis of latency in data reporting

I wonder what would be the best way to explore the latency with which data is reported. If I plot for example the observations of grey squirrel in the UK for recent years it seems likely that either the data for the last 2 years is still incomplete, monitoring intensity has decreased or the grey squirrel population has decreased dramatically. I am pretty sure it is the first, and it would be an interesting angle to explore for other species as well.

Is there something like a "firstUploaded" field in the dataset?

Adapt analytics to use the newly added multivalue fields in pipelines

The issue gbif/pipelines#665 brought some new interpreted fields and changed the typeStatus from string to array.

Some of the new fields added were used before as strings because they were being carried from the verbatim values. But now they are interpreted fields in the basic record.

You can see the changes done in the avro schemas here.

Analytics needs to be adapted to these changes to either use arrays or convert the arrays into strings.

Add regional rollups to the analytics

The current reports run global and country level.

Since GBIF run regional strategies and hold regional meetings, for which we regularly run adhoc analytics, we should aggregate the country counts to regional level based on the GBIF regions defined in the directory and available in the GBIF country enum API.

The result will be

Move simple growth and usage stats to this project

For some time @jlegind has privately maintained python scripts capturing basic statistics on https://jlegind.github.io/

These scripts should be moved into this project and executed on the same schedule as the analytics run, with the output stored in https://analytics-files.gbif.org, noting that the directory structure for that is to be changed.

I suggest that in exploring this, we consider the feasibility of porting this to the Hive/R approach that this project uses to keep the implementation simple. If that is not practical, then we should introduce the python dependencies.

Fix x-axis labels on Figure 3

Comment from Siro Masinde, 9 Feb 2016

Access and Usage, Fig. 3, on the X-axis you have Jan, Feb, but one would expect to see the labelling to go on to August and then December and also see clear vertical clear lines at the Aug and Dec positions.

Indicate 'insufficient data' for blank graphs

Comment from Siro Masinde, 9 Feb 2016

Is there a way of indicating “none” in spaces that lack data to generate a graph? Especially under data mobilization, Figures 11 and 12, when there is no data and the spaces are empty it starts looking as if there is an error/bug. Another way to clarify this may be to add a note somewhere in the legend.

Refactor for non-MR environment

The current scripts expect the Hive shell on a Cloudera gateway machine which currently launches MapReduce jobs on Yarn.
That won't be available in the K8s environment.

I foresee we could do one of these - there may be more options:

  • Explore the state of Hive on Spark (we gather that may have been removed)
  • Explore using Trino as the execution engine
  • Port it to Spark SQL

My intuition is that a Trino solution is the least invasive (perhaps we can replace hive -f ... with trino -f ...), and is likely the best option. It will likely require some UDF changes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.