datafable / gbif-dataset-metrics Goto Github PK

View Code? Open in Web Editor NEW

9.0 5.0 1.0 12.68 MB

Get insights in GBIF-mediated datasets with charts and metrics.

Home Page: https://chrome.google.com/webstore/detail/gbif-dataset-metrics/kcianglkepodpjdiebgidhdghoaeefba

License: MIT License

Python 66.75% CSS 5.82% JavaScript 22.33% HTML 5.11%

gbif-dataset-metrics's Introduction

GBIF dataset metrics

Rationale

The Global Biodiversity Information Facility (GBIF) facilitates access to over 13,233 species occurrence datasets, collectively holding more than 570 million records. GBIF dataset pages are important access points to GBIF-mediated data (e.g. via DOIs) and currently show dataset metadata, a map of georeferenced occurrences, some basic statistics, and a paged table of download events. If a user wants to know more about the occurrences a dataset contains, he/she has to filter/page through a table of occurrences or download the data. Neither are convenient ways to get quick insights or assess the fitness for use.

Result

For the 2015 GBIF Ebbe Nielsen challenge, we developed a proof of concept for enhancing GBIF dataset pages with aggregated occurrence metrics. These metrics are visualized as stacked bar charts - showing the occurrence distribution for basis of record, coordinates, multimedia, and taxa matched with the GBIF backbone - as well as an interactive taxonomy partition and a recent downloads chart. Metrics that score particularly well are highlighted as achievements. Collectively these features not only inform the user what a dataset contains and if it is fit for use, but also help data publishers discover what aspects could be improved.

The proof of concept consists of two parts: 1) an extraction and aggregation module to process GBIF occurrence downloads and calculate, aggregate, and store the metrics for each dataset and 2) a Google Chrome extension, allowing you to view these metrics in context on the GBIF website.

For the 2015 GBIF Ebbe Nielsen Challenge - Round 2, we added a sample of the images referenced in (the occurrences of) a dataset. Together with the multimedia bar and achievement, it highlights the currently undervalued multimedia richness of some datasets. We also improved our extraction and aggregation module to process all GBIF occurrences on the Amazon EC2 infrastructure and are now able to provide metrics for all GBIF occurrence datasets. We strongly believe however, that the functionality of our proof of concept - if considered useful - should be implemented on the GBIF infrastructure. For our motivation on this, including its challenges and opportunities, see our feedback to the jury comments.

Installation

Install the Google Chrome Extension and visit a GBIF dataset page.

Don't want to install the extension, but just see a preview? See this demo page.
Want to calculate the metrics yourself (i.e. run the backend)? Read the documentation on the extraction and aggregation module.

How it works

Limitations

The metrics are processed using a download of all occurrences on September 1, 2015. It contains 13,221 occurrences datasets, covering 570,238,726 occurrences. If a dataset is published or republished since then, it respectively won't have metrics or those might be out of date. If so, a message will be shown on the dataset page. If you want us to reprocess a specific dataset, submit an issue.

Follow @Datafable to be notified of new metrics or improvements.

Contributors

Developed by Datafable:

Peter Desmet (frontend)
Bart Aelterman (aggregation)
Nicolas Noé (extraction)

License

LICENSE

gbif-dataset-metrics's People

Contributors

Stargazers

Watchers

Forkers

digideskio

gbif-dataset-metrics's Issues

Add index.html file to extension

Add index.html file to extension to test charts.

Good fossil specimen datasets

Search

Searched and downloaded all records with basisOfRecord=FOSSIL_SPECIMEN
Created a list of datasets for which some of the geologicalContext fields are populated
Indicated for each field how many occurrences have it populated.
This resulted in 96 datasets, only 18 of which have geologicalContext info.

Result

datasetKey	remark	eon	era	period	epoch	age
58d0f326-2e85-4d0a-a744-571461220f00	Animals excl., Devonian to recent		120159 / 120159	214661 / 214661	208114 / 208114	345 / 345
854bc2fc-f762-11e1-a439-00145eb45e9a	Mixed taxa, full spectrum, levels missing			291896 /	27417 /	71292 /
9654e4d4-f762-11e1-a439-00145eb45e9a	Animals, describes spectrum			144074 /	32143 /	20139 /
5f1568db-5d5b-4597-9c66-dfbd0d7ddd7b	Animals excl., Mesozoic, Cenozoic		26994 / 26994	26977 / 26977	26857 / 26857
fa87d03e-4959-451d-865f-ff03bb798339	Animals, Cenozoic	27403/ 27403	27403 /	27403 / 3	27381 / 180	19809 / 4049
9643f840-f762-11e1-a439-00145eb45e9a				66448 /	61360 /	21605 /
9642d0e6-f762-11e1-a439-00145eb45e9a				62178 /	41970 /	25615 /
bec73359-a65f-4402-9692-f498ef0e71cb				59175 /	54189 /	5770 /
7a28b1b1-9d4c-4aeb-b239-38865417b5ea		9137/ 9137	9137 / 9137	9137 / 9137	9136 / 2944	1151 / 14
7c67da7f-490e-4e8d-8848-7dc152dd4734		7865/ 7865	7865 / 7865	7817 / 7817	7865 / 236	4147 /
bea28c6b-4282-4e0e-894d-7c65d050ffa9			10001 / 10001	9921 / 9921	8899 / 8899
5fbfc32a-2eee-4bdd-8ca4-e6fca275a7a8		5450/ 944	5427 / 944	5299 / 939	5223 / 907	4935 / 902
bd2feca8-ec39-4480-9dad-e353ab6a506d					12555 / 12555
93082fdb-1a18-40de-a25c-5d1b594a370d		7348/
47881e45-febd-4622-b7a1-6efbce4fd7b3			1306 / 1306	1305 / 1305	866 / 866
d538ab33-a853-471a-8e1d-be808dd7b922				1387 / 1387
81880628-c616-417e-8d1a-d519577c2087
7bb2d451-5ffa-4d58-bc7f-19ea7aecb201		71/ 71	71 / 71	71 / 71	70 / 2	66 / 66

Get coordinate quality categories

Description

For a given dataset, I want to know how many records have coordinates. I also want to know how many of those are useful, have issues, and maybe what their precision is.

Outcome

dataset_key
coordinates_not_provided // Coordinates not provided
coordinates_major_issues // Coordinates with major issues
coordinates_minor_issues // Coordinates with minor issues
coordinates_valid  // Valid coordinates (all in WGS84)

Terms we need

decimalLatitude
decimalLongitude
issue

Questions

PRESUMED_SWAPPED_COORDINATE, PRESUMED_NEGATED_LATITUDE, PRESUMED_NEGATED_LONGITUDE could be useful to the provider as minor issues, but to the user, these are quite valuable. Where would you group them?

Process

IF issue CONTAINS (
        COORDINATE_INVALID /* Can appear for invalid verbatim => no decimal coordinates */
        COORDINATE_OUT_OF_RANGE /* Can appear for invalid verbatim => no decimal coordinates */
        ZERO_COORDINATE
        COUNTRY_COORDINATE_MISMATCH
        /* I consider COUNTRY_COORDINATE_MISMATCH as a major issue,
           since it looks like GBIF only applies this when there are no country issues, 
           such as COUNTRY_INVALID */
    )
    THEN category = "coordinates_major_issues"
ELSEIF issues CONTAINS (
        GEODETIC_DATUM_INVALID /* Always followed by GEODETIC_DATUM_ASSUMED_WGS84,
            but it does indicate that the provider wanted to indicate the datum. */
        COORDINATE_REPROJECTION_FAILED /* Then GBIF just uses the original ones */
        COORDINATE_REPROJECTION_SUSPICIOUS /* Indicates successful coordinate reprojection
            according to provided datum, but which results in a datum shift larger 
            than 0.1 decimal degrees.*/
    )
    THEN category = "coordinates_minor_issues"
ELSEIF decimalLatitude = "" OR decimalLongitude = ""
    /* Not sure if we need to test for isNumber(), I think GBIF transforms those already */
    /* Also, this ELSEIF could appear between major and minor issues, as minor issues will always 
        have coordinates. I placed it here to have all issue checking first. */
    THEN category = "coordinates_not_provided"
ELSE category = "coordinates_valid"
    /* This can include issues like:
         GEODETIC_DATUM_ASSUMED_WGS84
         COORDINATE_REPROJECTED
         COORDINATE_ROUNDED (to 5 decimals)
         PRESUMED_SWAPPED_COORDINATE
         PRESUMED_NEGATED_LATITUDE
         PRESUMED_NEGATED_LONGITUDE
     Although these are issues, they are all corrected by GBIF and result into valuable WGS84 coordinates

Why is coordinate invalid interpreted as having no coordinate issues?

Records with the issue COORDINATE_INVALID have hasCoordinateIssues set to false. Is this a bug? That type of issues seems it would raise true:

Coordinate value given in some form but GBIF is unable to interpret it.

Get media type categories

Description

For a given dataset, I want to know how many records have media associated. I also want to know the types and if there are any issues.

Output

datasetKey
media_not_provided
media_url_invalid
media_audio
media_video
media_image

Terms we need

mediaType
issues

Process

IF issues CONTAINS ( MULTIMEDIA_URI_INVALID )
    /* Most records with this issue have no mediaType, but 890 have mediaType=STILLIMAGE
       We want to usable media, so we want the issue checked first */
    THEN category = "media_url_invalid"
ELSEIF mediaType CONTAINS ( MOVINGIMAGE )
    /* The mediaType categories are not mutually exclusive: it seems 25 records have more than 1
        (possible via extension). To get mutually exclusive categories, we process them in order. */ 
    THEN category = "media_video"
ELSEIF mediaType CONTAINS ( AUDIO )
    THEN category = "media_audio"
ELSEIF mediaType CONTAINS ( STILLIMAGE )
    THEN category = "media_image"
ELSE category = "media_not_provided"

Document multimedia categories

At: https://github.com/peterdesmet/gbif-challenge/blob/master/documentation/media_type.md

Create metrics table

Time interpretation

Time provided in eventDate

Verbatim: 2014-01-08T13:17:36Z
Website: Jan 8, 2014 1:17:36 PM Correct
API: 2014-01-08T12:17:36.000+0000 Correct
Download: ? Haven't tested yet.

Time provided in eventTime

Verbatim: 2014-04-18 + 09:45
Website: Apr 18, 2014 12:00:00 AM Time not included.
API: 2014-04-17T22:00:00.000+0000 Why minus two hours from midnight? Location (Columbia) has a 5 hour difference from GMT, so that can't be it.
Download: 2014-04-18T00:00Z Time not included.

Display alert if no metrics are found

Somewhere at the top, maybe dismissible + with link to repo.

Do more calls for more download data

This download function does only one call to the GBIF API, so if dayRange is high and pageLimit low, it might not retrieve all downloads. Would be better is more calls could be triggered.

Add error handling when dataset was not found

Go to http://www.gbif.org/dataset/66178162-01dc-4133-9b3c-83265481c383
See console Uncaught TypeError: Cannot read property 'occurrences' of undefined
Add error handling when dataset was not found.

Document taxonomy

At: https://github.com/peterdesmet/gbif-challenge/blob/master/documentation/taxonomy.md

Choose viz for geologic time scale

Here are two representations of geologic time:

Geological society: 4 blocks with subdivisions. This scale works best for period and epoch.
Wikipedia: clock. This scale works only for eons (highest level).

My preference goes to 1. I would focus on periods, and maybe extend to eras (up) and epochs (down).

D3

Interactive geologic time scale that can be zoomed in and out
Several visualizations on fossil records

Show basis of record categories

Taxonomy visualization: alternatives

Different alternatives for showing the taxonomy. @niconoe, @bartaelterman, preferences?

Zoomable horizontal partition

http://bl.ocks.org/mbostock/1005873

+ Shows all levels
+ Compact
- No horizontal place for labels (hard to predict)
- Labels do not depend on zoom level, but that we could make use of something like http://bl.ocks.org/jczaplew/7546689

Zoomable vertical partition

http://mbostock.github.io/d3/talk/20111018/partition.html

+ Place for labels
+ Label depends on zoom level
+ Shows all levels
+ Can show nodes with no children (e.g. up to genus)
- Will not use full width if not all levels are shown, but width could be dynamic

Zoomable circle packing

http://bl.ocks.org/mbostock/7607535

+ Place for labels (but often a bit hard to see)
+ Level depends on zoom level
+ Shows most levels
- Requires space
- Difficult to assess distribution of size

Zoomable treemap

http://bost.ocks.org/mike/treemap/

+ Place for labels
+ Level depends on zoom level
- Only shows 2 levels
- Difficult to assess distribution of size

Show taxon match categories

Replace jQuery functions with d3

Figure out if I can rely on d3 functions to target elements in the DOM and append/prepend stuff.

Show sample of images

Show a sample of the images that can be found in (the occurrences of) the dataset. Clicking an image ideally shows the occurrence rather than the image itself.

Sample for iNaturalist:

I'll create a working example of this before I try to describe the backend stuff.

Test Chrome plugin

Document taxon match categories

At: https://github.com/peterdesmet/gbif-challenge/blob/master/documentation/taxon_match.md

Example dataset

I'm going to use Australia's Virtual Herbarium as an example dataset for testing. datasetKey = 4ce8e3f9-2546-4af1-b28d-e2eadf05dfd4

It has multiple issues
It has multiple basis of records
It has multiple kingdoms
It's over 5 million records, so a good test for performance

Document coordinate quality

At: https://github.com/peterdesmet/gbif-challenge/blob/master/documentation/coordinates_quality.md

Show taxonomy

@niconoe, @bartaelterman: preference, feedback on colour?

1

Neutral grey, just like kingdom table. Doesn't really invite to click.

2

Text = link colour = interaction. Background = same colour as sometimes used on home page. Grey border not the best., as you can't quickly see the smaller groups (as you can in 3 and 4 = white).

3

Background colour = link colour = interaction. Children grey. Too bright?

4

Idem, but with grey unknowns.

If you hover over the children, the cursor will change to a zoom-out symbol, which the viz does when you click on a child.

Show coordinate quality categories

As a stacked bar chart. Will need data in cartodb.

Approach, Architecture and technical proposals

High-level architecture

Crawler: On a regular basis, retreive data from GBIF, run analyses and fill a database with configurable metrics.
Webservices: Expose the content of the database with a friendly API
Client: consume the webservices and present results in a beautiful way
Packaging: embed the client so GBIF pages are enriched in place (instead of a separate website)

Technology proposal

These are the choices I'd made if I had to implement all of this myself. They were selected for two main reasons: familiariry and fitness for use. I'm open to all cricticism since 1) many other tools would work well too, 2) familiarity is important and related the person in charge of each module.

I'd also propose to adhere to the KISS principle and avoid jumping on every cool kid's tool before it's clear that their technical benefits overweight the cost of use (learning curve / hidden complexity / maintenance cost).

As a first step, I think the best solution is to implement it transversally (a minimal working prototype of each component), then iterate on each in parallel. That gives amximum flexibility, giving plenty of oppurtinities to refine and fine-tune the techological choices and interfaces betwen modules.

Backend: Crawler + webservices

Django (exposing JSON) + PostgreSQL:

Very proven solution.
Using Django for the first two modules will provide good facilities and an integrated solution: for example using the ORM and helpers from both the crawler (django commands run by cron) and the webservices.
Plenty of available extensions for Django on every topic, including REST/webservices (django-tastypie, django-rest-framework, ...) while I'm not sure they will be needed at all.
Super easy to add an admin interface and additional web pages if necessary.

Alternative solution: lighter tools glued together (Flask + external crawler scripts + Postgres + ... )

TODO: design the basis of the data model.
TODO: major question: get data from API or Darwin Core Archive.

Note: if the crawler consume GBIF web services, I recently developed a (currently tiny, quick and dirty) package to use them. It is currenly embedded into another project. To avoid reinventing the wheel, I'd like to take time to extract it to a proper (documented, tested and PyPi distributed) python package. Opinions?

Note: if we consume DwCA: python-dwca-reader.

Frontend: client:

D3.js + jQuery + optional client MVC framework

Frontend: Packaging:

Chrome extension? Greasemonkey? Both? Additional web pages?

Workload divison

A critical and question urgent question, IMHO.
May have an impact of the technological choices.
I think we can basically divide in 4 work packages that follows the architecture/modules. We will need to have a good rough idea of what will be the interface between these 4 modules. We may also have to add one or two "utility" work packages: sysadmin-deployment/project management/...
I (Nico) am primarily interested in module 1) Crawler and if time allows, 2) Webservices
I (Nico) am willing to create soon a rough prototype of 1) and 2) that will allow (after also creating a quick prototype of 3 and 4) to validate the whole dataflow/architecture/technology choices.

Hosting

Also a choice that could impact technological choices.
Options: using a server we already have access to / VPS / Cloud-based solution
At first look, I prefer the VPS solution: it's cheap, we are totally independent and we have full flexibility (root access). See for example https://www.ovh.com/fr/vps/vps-classic.xml. I generally love working with Heroku but I've been recently suprised by all the hidden costs that appears once you need a few options (background processes, static file hosting, redis-queue, mail sending service, ...)

Next steps

Discuss all of the above
Agree on the workload division
Brainstorm on the basic interfaces (top importance: between webservices and client, but also the database that acts as an interface between crawler and webservices and the consistency of the whole dataflow.)
Code!

Get basis of record categories

Description

For a given dataset, I want to know how many records have a certain basis of record. I also want to know how many of those are invalid. I envision this as a bar chart, where the records are grouped in categories based on basis of record.

Outcome

dataset_key
bor_preserved_specimen  // Preserved specimens
bor_fossil_specimen     // Fossil specimens
bor_living_specimen     // Living specimens
bor_material_sample     // Material samples
bor_observation         // Observations
bor_human_observation   // Human observations
bor_machine_observation // Machine observations
bor_literature          // Literature occurrences
bor_unknown             // Unknown

Terms we need

basisOfRecord

Ideally, use null if count is 0.

Questions

BASIS_OF_RECORD_INVALID returns 0 results. Most likely covered by Unknown evidence, can thus be ignored.
The basisOfRecord categories that GBIF provides are mutually exclusive.

Process

/* Map basisOfRecord to categories */

hasCoordinate is always false

In the download file, the field hasCoordinate is provided, which is defined as:

Boolean indicating that a valid latitude and longitude exists. Even if existing it might still have issues, see hasGeospatialIssues and issue.

In the downloads we tested however, that field was always false, even when valid coordinates were available. In the API, it works as it should.

Get taxon match categories

Description

For a given dataset, I want to know how many records provide a taxon. I also want to know how many of those match the GBIF taxonomy and if there are any issues.

Outcome

dataset_key
taxon_not_provided
taxon_match_none
taxon_match_higherrank
taxon_match_fuzzy
taxon_match_complete

Terms we need

scientificName
genus
issues

Process

IF scientificName = "" OR genus = ""
    /* If scientificName is empty, GBIF builds a name with genus, specificEpithet, etc, see
       https://github.com/gbif/occurrence/blob/master/occurrence-processor/src/main/java/org/gbif/occurrence/processor/interpreting/TaxonomyInterpreter.java#L34
       If scientificName is empty, we can check for genus (no need to check other atomized fields)
       Note: TAXON_MATCH_NONE is applied for empty taxa (unless record was indexed before that
       issue was applied). */
    THEN category = "taxon_not_provided"
ELSEIF issues CONTAINS (TAXON_MATCH_NONE)
    THEN category ="taxon_match_none"
ELSEIF issues CONTAINS (TAXON_MATCH_HIGHERRANK)
    THEN category = "taxon_match_higherrank"
ELSEIF issues CONTAINS (TAXON_MATCH_FUZZY)
    THEN category = "taxon_match_fuzzy"
ELSE category = "taxon_match_complete"

Overview of metrics we could show

Stats

Downloads
Current crawling status

Occurrences completeness (using issues logged by GBIF)

~~Completeness percentage for each Darwin Core term~~
Completeness of scientific name + taxon issues
Completeness of coordinates + coordinate issues
Completeness of higher geography + country/continent issues
Completeness of recorded date + recorded date issues
Completeness of elevation + elevation issues
Completeness of depth + depth issues
Completeness of basis of record + basis or record issues
Completeness of multimedia (sound, images, videos)

Precision

Precision of scientific name
Precision of coordinates

Occurrence range

Taxonomic range (already done on stats page): krona, treemap, etc.
~~Location range (already done on map)~~: map is best way
Decade range (already in analytics)
~~Day of year (already in analytics)~~: results not that interesting
Elevation range
Depth range
~~Last modified range~~: better to aggregate this into one metric
Type status metrics (already done on stats page) + type status issues
Random 20 pictures to have something visual

Ranking (requires full GBIF data, not for POC)

Most downloads
Widest range (taxonomic, etc.)
Recentness

Dataset type

Metadata

Gauge meter for metadata, based on something we need to define.

Completeness of sections
Word count
...

Add plugin interaction

Plugin interaction could be: user clicks on icon to activate plugin... read this documentation.

Give some indication what x-axis represents on download chart

Document download metrics

At: https://github.com/peterdesmet/gbif-challenge/blob/master/documentation/downloads.md

Fix header display issue

The extension causes a display issue on home and stats pages (not on activity). Need to figure out how to solve this.

eventDate can be set blank with no issue thrown

Provided eventDates such as:

1818-06-20/2006-03-16 http://www.gbif.org/occurrence/1051028027/verbatim (ISO8601!)
October 10, 1966 http://www.gbif.org/occurrence/919108624/verbatim
1998-6-4 & 1998-6-4 http://www.gbif.org/occurrence/142912750/verbatim
April 1879 http://www.gbif.org/occurrence/575072873/verbatim (is shown correctly on occurrence page though).

Are set to blank (in download files and API), with no issue such as RECORDED_DATE_INVALID thrown. You would expect some issue like RECORDED_DATE_NOT_INTERPRETED to identify those.

List of all issues GBIF provides

Add occurrence count to metrics

For the frontend, I need the total number of occurrences per dataset (e.g. to calculate percentages). I could get it from the GBIF API, but it would be more convenient to have it in the metrics store.

You could populate it based on e.g. sum of basis of records, but if you want to use it as a check to see if you processed all records, it's better to get it from the GBIF API: http://api.gbif.org/v1/occurrence/count?datasetKey=42319b8f-9b9d-448d-969f-656792a69176

Cleanup/harmonize directory layout

@peterdesmet and I used specific directories for our code (extension, data_extension_module, ...) while @bartaelterman created a standard python package layout (src, bin, ...) at the repository root. We should agree on one approach and harmonize this before it becomes a mess.

Add thousands separator for total downloads

Verify which fields are used for taxonomy

Show multimedia categories

Finalize manifest.json

See this example

Add icons
Add background
Add browser_action (if we only want to enable the extension when clicked)
Add offline_enabled
Add permissions (only if we want to use Chrome.* APIs)

Create downloads chart with timeseries x-axis

The downloads x-axis is currently based on an array with the number of days to be shown (from 0 to e.g. 50). It would be more flexible (and nicer for labels) if this x-axis was a time series:

No need to take care of dates with no downloads
Hoover would show the date instead of a non-descript number
One could more easily load more data in the chart (e.g. moving back and forth in time)

Quick code:

var chart2 = c3.generate({
            bindto: "#downloadsChart2",
            data: {
                x: "x",
                columns: [
                    ["x","2014-12-22","2014-12-23","2014-12-25","2014-12-27","2014-12-28","2014-12-29"],
                    ["downloads",1,2,5,1,1,3]
                ],
                type: "bar"
            },
            axis: {
                x: {
                    type: "timeseries",
                    tick: {
                        format: "%Y-%m-%d"
                    }
                }
            }
        });

Add regions for weekends on downloads chart

Add grey regions to indicate weekends on downloads chart

Start documentation

Each feature has own Markdown page describing it. This will also serve as a description of the POC. Elements:

Title
Description
Rationale
Services used
Suggestions for GBIF (e.g. add to http://api.gbif.org/v1/occurrence/count/schema or bug reports)
Suggestions for further development

Write basic test suite for the data extraction module

Get taxonomy

I currently get the taxonomy data as:

{
    "Plantae": {
        "Pteridophyta": {
            "Polypodiopsida": {
                "Cyatheales": {
                    "Lophosoriaceae": 2,
                    "Cyatheaceae": 130,
                    "Dicksoniaceae": 216
                }
            }
        }
    }
}

Which is really concise. For the visualization though I need:

{
    "name": "Plantae",
    "children": [
    {
        "name": "Pteridophyta",
        "children": [
        {
            "name": "Polypodiopsida",
            "children": [
            {
                "name": "Cyatheales",
                "children": [
                    { "name": "Lophosoriaceae", "size": 2 },
                    { "name": "Cyatheaceae", "size": 130 },
                    { "name": "Dicksoniaceae", "size": 216 }
                ]
            }]
        }]
    }]
}

Any suggestions how to transform this?

Only use integers on download chart y-axis

Show number of records with valid coordinates

This metric is a combination of:

decimalLatitude populated
decimalLongitude populated
No serious coordinate issues with either
Optional: coordinates are in WGS84

Available terms

	API	Download	Remarks
`decimalLatitude`	yes	yes
`decimalLongitude`	yes	yes
`issues`	yes	yes	Controlled vocabulary, not limited to coordinate issues
`geodeticDatum`	yes	no
`hasCoordinate`	no	yes	Boolean indicating that a valid latitude and longitude exists. Even if existing it might still have issues, see hasGeospatialIssues and issue. source
`hasGeospatialIssues`	no	yes	Boolean indicating that some spatial validation rule has not passed. Primarily used to indicate that the record should not be displayed on a map. source

Proposal

Using hasCoordinate = true and hasGeospatialIssues = false seem like the best solution, even though it is a bit unclear which issues are ignored. Unfortunately for the download I tested (with plenty of valid coordinates), hasCoordinate was always false (Saved as a bug at #21). So, we'd have to test ourselves for decimalLatitude != empty AND decimalLongitude != empty.

hasGeospatialIssues can still be used: it uses true and false. I'll figure out which issues are matched.

Create date quality categories

Description

For a given dataset, I want to know how many records have dates. I also want to know how many of those are useful, have issues, and maybe what their precision is. I envision this as a bar chart, where the records are grouped in categories based on the quality of the dates.

Categories (in order of increasing data quality)

Date not provided
Date with major issues
Date with minor issues
Valuable date (all in ISO8601)

Questions

GBIF sets many dates to blank without trowing an issue (see #27).
GBIF makes no attempt at signaling dubious dates: in fact, it throws RECORDED_DATE_MISMATCH if day, year, month are correctly provided.
RECORDED_DATE_UNLIKELY also matches invalid dates: 99 XXX 9999
So the 3 relevant issues RECORDED_DATE_INVALID, RECORDED_DATE_MISMATCH and RECORDED_DATE_MISMATCH are of limited use to indicate the data quality. One approach to provide much more relevant date information, is to use the Canadensys Narwhal Processor.
If no eventDate is provided, GBIF doesn't seem to look in verbatimEventDate or year, month, day. The literal values of those fields are shown on the website though.

Terms we need

eventDate
issue
eventDate from verbatim.txt
verbatimEventDate
year
month
day

Process

IF eventDate != "" AND issue DOES NOT CONTAIN (
        RECORDED_DATE_MISMATCH
    )
    THEN category = "Valuable date (all in ISO8601)" /* Well, MM-DD-YYYY are still in there */
ELSEIF issue CONTAINS (
        RECORDED_DATE_MISMATCH /* The only issue that keep eventDate populated */
        )
        verbatim.txt.eventDate != "" /* Since GBIF empties eventDate (see #27) in occurrence.txt, 
            we'd have to look in verbatim.txt :( */
        OR verbatimEventDate != ""
        OR year != ""
        OR (year != "" AND month != "")
        OR (year != "" AND month != "" AND day !="")
    /* A date was provided */
    THEN category = "Date provided, but not interpreted by GBIF"
ELSE
    category = "Date not provided"

Document basis of record categories

At: https://github.com/peterdesmet/gbif-challenge/blob/master/documentation/basis_of_record.md

datafable / gbif-dataset-metrics Goto Github PK

gbif-dataset-metrics's Introduction

GBIF dataset metrics

Rationale

Result

Installation

How it works

Limitations

Contributors

License

gbif-dataset-metrics's People

Contributors

Stargazers

Watchers

Forkers

gbif-dataset-metrics's Issues

Search

Result

Description

Outcome

Terms we need

Questions

Process

Description

Output

Terms we need

Process

Time provided in eventDate

Time provided in eventTime

D3

Zoomable horizontal partition

Zoomable vertical partition

Zoomable circle packing

Zoomable treemap

1

2

3

4

High-level architecture

Technology proposal

Backend: Crawler + webservices

Frontend: client:

Frontend: Packaging:

Workload divison

Hosting

Next steps

Description

Outcome

Terms we need

Questions

Process

Description

Outcome

Terms we need

Process

Stats

Occurrences completeness (using issues logged by GBIF)

Precision

Occurrence range

Ranking (requires full GBIF data, not for POC)

Dataset type

Metadata

Available terms

Proposal

Description

Categories (in order of increasing data quality)

Questions

Terms we need

Process

Recommend Projects

Recommend Topics

Recommend Org

Jobs