GithubHelp home page GithubHelp logo

datafable / gbif-dataset-metrics Goto Github PK

View Code? Open in Web Editor NEW
9.0 5.0 1.0 12.68 MB

Get insights in GBIF-mediated datasets with charts and metrics.

Home Page: https://chrome.google.com/webstore/detail/gbif-dataset-metrics/kcianglkepodpjdiebgidhdghoaeefba

License: MIT License

Python 66.75% CSS 5.82% JavaScript 22.33% HTML 5.11%

gbif-dataset-metrics's Introduction

GBIF dataset metrics

Rationale

The Global Biodiversity Information Facility (GBIF) facilitates access to over 13,233 species occurrence datasets, collectively holding more than 570 million records. GBIF dataset pages are important access points to GBIF-mediated data (e.g. via DOIs) and currently show dataset metadata, a map of georeferenced occurrences, some basic statistics, and a paged table of download events. If a user wants to know more about the occurrences a dataset contains, he/she has to filter/page through a table of occurrences or download the data. Neither are convenient ways to get quick insights or assess the fitness for use.

Result

For the 2015 GBIF Ebbe Nielsen challenge, we developed a proof of concept for enhancing GBIF dataset pages with aggregated occurrence metrics. These metrics are visualized as stacked bar charts - showing the occurrence distribution for basis of record, coordinates, multimedia, and taxa matched with the GBIF backbone - as well as an interactive taxonomy partition and a recent downloads chart. Metrics that score particularly well are highlighted as achievements. Collectively these features not only inform the user what a dataset contains and if it is fit for use, but also help data publishers discover what aspects could be improved.

Screenshot

The proof of concept consists of two parts: 1) an extraction and aggregation module to process GBIF occurrence downloads and calculate, aggregate, and store the metrics for each dataset and 2) a Google Chrome extension, allowing you to view these metrics in context on the GBIF website.

For the 2015 GBIF Ebbe Nielsen Challenge - Round 2, we added a sample of the images referenced in (the occurrences of) a dataset. Together with the multimedia bar and achievement, it highlights the currently undervalued multimedia richness of some datasets. We also improved our extraction and aggregation module to process all GBIF occurrences on the Amazon EC2 infrastructure and are now able to provide metrics for all GBIF occurrence datasets. We strongly believe however, that the functionality of our proof of concept - if considered useful - should be implemented on the GBIF infrastructure. For our motivation on this, including its challenges and opportunities, see our feedback to the jury comments.

Installation

Install the Google Chrome Extension and visit a GBIF dataset page.

How it works


Limitations

  • The metrics are processed using a download of all occurrences on September 1, 2015. It contains 13,221 occurrences datasets, covering 570,238,726 occurrences. If a dataset is published or republished since then, it respectively won't have metrics or those might be out of date. If so, a message will be shown on the dataset page. If you want us to reprocess a specific dataset, submit an issue.

Follow @Datafable to be notified of new metrics or improvements.

Contributors

Developed by Datafable:

License

LICENSE

gbif-dataset-metrics's People

Contributors

bartaelterman avatar niconoe avatar peterdesmet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

digideskio

gbif-dataset-metrics's Issues

Good fossil specimen datasets

Search

  1. Searched and downloaded all records with basisOfRecord=FOSSIL_SPECIMEN
  2. Created a list of datasets for which some of the geologicalContext fields are populated
  3. Indicated for each field how many occurrences have it populated.
  4. This resulted in 96 datasets, only 18 of which have geologicalContext info.

Result

datasetKey remark eon era period epoch age
58d0f326-2e85-4d0a-a744-571461220f00 Animals excl., Devonian to recent 120159 / 120159 214661 / 214661 208114 / 208114 345 / 345
854bc2fc-f762-11e1-a439-00145eb45e9a Mixed taxa, full spectrum, levels missing 291896 / 27417 / 71292 /
9654e4d4-f762-11e1-a439-00145eb45e9a Animals, describes spectrum 144074 / 32143 / 20139 /
5f1568db-5d5b-4597-9c66-dfbd0d7ddd7b Animals excl., Mesozoic, Cenozoic 26994 / 26994 26977 / 26977 26857 / 26857
fa87d03e-4959-451d-865f-ff03bb798339 Animals, Cenozoic 27403/ 27403 27403 / 27403 / 3 27381 / 180 19809 / 4049
9643f840-f762-11e1-a439-00145eb45e9a 66448 / 61360 / 21605 /
9642d0e6-f762-11e1-a439-00145eb45e9a 62178 / 41970 / 25615 /
bec73359-a65f-4402-9692-f498ef0e71cb 59175 / 54189 / 5770 /
7a28b1b1-9d4c-4aeb-b239-38865417b5ea 9137/ 9137 9137 / 9137 9137 / 9137 9136 / 2944 1151 / 14
7c67da7f-490e-4e8d-8848-7dc152dd4734 7865/ 7865 7865 / 7865 7817 / 7817 7865 / 236 4147 /
bea28c6b-4282-4e0e-894d-7c65d050ffa9 10001 / 10001 9921 / 9921 8899 / 8899
5fbfc32a-2eee-4bdd-8ca4-e6fca275a7a8 5450/ 944 5427 / 944 5299 / 939 5223 / 907 4935 / 902
bd2feca8-ec39-4480-9dad-e353ab6a506d 12555 / 12555
93082fdb-1a18-40de-a25c-5d1b594a370d 7348/
47881e45-febd-4622-b7a1-6efbce4fd7b3 1306 / 1306 1305 / 1305 866 / 866
d538ab33-a853-471a-8e1d-be808dd7b922 1387 / 1387
81880628-c616-417e-8d1a-d519577c2087
7bb2d451-5ffa-4d58-bc7f-19ea7aecb201 71/ 71 71 / 71 71 / 71 70 / 2 66 / 66

Get coordinate quality categories

Description

For a given dataset, I want to know how many records have coordinates. I also want to know how many of those are useful, have issues, and maybe what their precision is.

Outcome

dataset_key
coordinates_not_provided // Coordinates not provided
coordinates_major_issues // Coordinates with major issues
coordinates_minor_issues // Coordinates with minor issues
coordinates_valid  // Valid coordinates (all in WGS84)

Terms we need

decimalLatitude
decimalLongitude
issue

Questions

  • PRESUMED_SWAPPED_COORDINATE, PRESUMED_NEGATED_LATITUDE, PRESUMED_NEGATED_LONGITUDE could be useful to the provider as minor issues, but to the user, these are quite valuable. Where would you group them?

Process

IF issue CONTAINS (
        COORDINATE_INVALID /* Can appear for invalid verbatim => no decimal coordinates */
        COORDINATE_OUT_OF_RANGE /* Can appear for invalid verbatim => no decimal coordinates */
        ZERO_COORDINATE
        COUNTRY_COORDINATE_MISMATCH
        /* I consider COUNTRY_COORDINATE_MISMATCH as a major issue,
           since it looks like GBIF only applies this when there are no country issues, 
           such as COUNTRY_INVALID */
    )
    THEN category = "coordinates_major_issues"
ELSEIF issues CONTAINS (
        GEODETIC_DATUM_INVALID /* Always followed by GEODETIC_DATUM_ASSUMED_WGS84,
            but it does indicate that the provider wanted to indicate the datum. */
        COORDINATE_REPROJECTION_FAILED /* Then GBIF just uses the original ones */
        COORDINATE_REPROJECTION_SUSPICIOUS /* Indicates successful coordinate reprojection
            according to provided datum, but which results in a datum shift larger 
            than 0.1 decimal degrees.*/
    )
    THEN category = "coordinates_minor_issues"
ELSEIF decimalLatitude = "" OR decimalLongitude = ""
    /* Not sure if we need to test for isNumber(), I think GBIF transforms those already */
    /* Also, this ELSEIF could appear between major and minor issues, as minor issues will always 
        have coordinates. I placed it here to have all issue checking first. */
    THEN category = "coordinates_not_provided"
ELSE category = "coordinates_valid"
    /* This can include issues like:
         GEODETIC_DATUM_ASSUMED_WGS84
         COORDINATE_REPROJECTED
         COORDINATE_ROUNDED (to 5 decimals)
         PRESUMED_SWAPPED_COORDINATE
         PRESUMED_NEGATED_LATITUDE
         PRESUMED_NEGATED_LONGITUDE
     Although these are issues, they are all corrected by GBIF and result into valuable WGS84 coordinates

Get media type categories

Description

For a given dataset, I want to know how many records have media associated. I also want to know the types and if there are any issues.

Output

datasetKey
media_not_provided
media_url_invalid
media_audio
media_video
media_image

Terms we need

mediaType
issues

Process

IF issues CONTAINS ( MULTIMEDIA_URI_INVALID )
    /* Most records with this issue have no mediaType, but 890 have mediaType=STILLIMAGE
       We want to usable media, so we want the issue checked first */
    THEN category = "media_url_invalid"
ELSEIF mediaType CONTAINS ( MOVINGIMAGE )
    /* The mediaType categories are not mutually exclusive: it seems 25 records have more than 1
        (possible via extension). To get mutually exclusive categories, we process them in order. */ 
    THEN category = "media_video"
ELSEIF mediaType CONTAINS ( AUDIO )
    THEN category = "media_audio"
ELSEIF mediaType CONTAINS ( STILLIMAGE )
    THEN category = "media_image"
ELSE category = "media_not_provided"

Time interpretation

Time provided in eventDate

  • Verbatim: 2014-01-08T13:17:36Z
  • Website: Jan 8, 2014 1:17:36 PM Correct
  • API: 2014-01-08T12:17:36.000+0000 Correct
  • Download: ? Haven't tested yet.

Time provided in eventTime

  • Verbatim: 2014-04-18 + 09:45
  • Website: Apr 18, 2014 12:00:00 AM Time not included.
  • API: 2014-04-17T22:00:00.000+0000 Why minus two hours from midnight? Location (Columbia) has a 5 hour difference from GMT, so that can't be it.
  • Download: 2014-04-18T00:00Z Time not included.

Do more calls for more download data

This download function does only one call to the GBIF API, so if dayRange is high and pageLimit low, it might not retrieve all downloads. Would be better is more calls could be triggered.

Taxonomy visualization: alternatives

Different alternatives for showing the taxonomy. @niconoe, @bartaelterman, preferences?

Zoomable horizontal partition

http://bl.ocks.org/mbostock/1005873

  • + Shows all levels
  • + Compact
  • - No horizontal place for labels (hard to predict)
  • - Labels do not depend on zoom level, but that we could make use of something like http://bl.ocks.org/jczaplew/7546689

Zoomable vertical partition

http://mbostock.github.io/d3/talk/20111018/partition.html

  • + Place for labels
  • + Label depends on zoom level
  • + Shows all levels
  • + Can show nodes with no children (e.g. up to genus)
  • - Will not use full width if not all levels are shown, but width could be dynamic

Zoomable circle packing

http://bl.ocks.org/mbostock/7607535

  • + Place for labels (but often a bit hard to see)
  • + Level depends on zoom level
  • + Shows most levels
  • - Requires space
  • - Difficult to assess distribution of size

Zoomable treemap

http://bost.ocks.org/mike/treemap/

  • + Place for labels
  • + Level depends on zoom level
  • - Only shows 2 levels
  • - Difficult to assess distribution of size

Show sample of images

Example dataset

I'm going to use Australia's Virtual Herbarium as an example dataset for testing. datasetKey = 4ce8e3f9-2546-4af1-b28d-e2eadf05dfd4

  • It has multiple issues
  • It has multiple basis of records
  • It has multiple kingdoms
  • It's over 5 million records, so a good test for performance

Show taxonomy

@niconoe, @bartaelterman: preference, feedback on colour?

1

Neutral grey, just like kingdom table. Doesn't really invite to click.

grey

2

Text = link colour = interaction. Background = same colour as sometimes used on home page. Grey border not the best., as you can't quickly see the smaller groups (as you can in 3 and 4 = white).

light blue

3

Background colour = link colour = interaction. Children grey. Too bright?

blue

4

Idem, but with grey unknowns.

blue unknown

If you hover over the children, the cursor will change to a zoom-out symbol, which the viz does when you click on a child.

Approach, Architecture and technical proposals

High-level architecture

  1. Crawler: On a regular basis, retreive data from GBIF, run analyses and fill a database with configurable metrics.
  2. Webservices: Expose the content of the database with a friendly API
  3. Client: consume the webservices and present results in a beautiful way
  4. Packaging: embed the client so GBIF pages are enriched in place (instead of a separate website)

Technology proposal

These are the choices I'd made if I had to implement all of this myself. They were selected for two main reasons: familiariry and fitness for use. I'm open to all cricticism since 1) many other tools would work well too, 2) familiarity is important and related the person in charge of each module.

I'd also propose to adhere to the KISS principle and avoid jumping on every cool kid's tool before it's clear that their technical benefits overweight the cost of use (learning curve / hidden complexity / maintenance cost).

As a first step, I think the best solution is to implement it transversally (a minimal working prototype of each component), then iterate on each in parallel. That gives amximum flexibility, giving plenty of oppurtinities to refine and fine-tune the techological choices and interfaces betwen modules.

Backend: Crawler + webservices

Django (exposing JSON) + PostgreSQL:

  • Very proven solution.
  • Using Django for the first two modules will provide good facilities and an integrated solution: for example using the ORM and helpers from both the crawler (django commands run by cron) and the webservices.
  • Plenty of available extensions for Django on every topic, including REST/webservices (django-tastypie, django-rest-framework, ...) while I'm not sure they will be needed at all.
  • Super easy to add an admin interface and additional web pages if necessary.

Alternative solution: lighter tools glued together (Flask + external crawler scripts + Postgres + ... )

TODO: design the basis of the data model.
TODO: major question: get data from API or Darwin Core Archive.

Note: if the crawler consume GBIF web services, I recently developed a (currently tiny, quick and dirty) package to use them. It is currenly embedded into another project. To avoid reinventing the wheel, I'd like to take time to extract it to a proper (documented, tested and PyPi distributed) python package. Opinions?

Note: if we consume DwCA: python-dwca-reader.

Frontend: client:

D3.js + jQuery + optional client MVC framework

Frontend: Packaging:

Chrome extension? Greasemonkey? Both? Additional web pages?

Workload divison

  • A critical and question urgent question, IMHO.
  • May have an impact of the technological choices.
  • I think we can basically divide in 4 work packages that follows the architecture/modules. We will need to have a good rough idea of what will be the interface between these 4 modules. We may also have to add one or two "utility" work packages: sysadmin-deployment/project management/...
  • I (Nico) am primarily interested in module 1) Crawler and if time allows, 2) Webservices
  • I (Nico) am willing to create soon a rough prototype of 1) and 2) that will allow (after also creating a quick prototype of 3 and 4) to validate the whole dataflow/architecture/technology choices.

Hosting

  • Also a choice that could impact technological choices.
  • Options: using a server we already have access to / VPS / Cloud-based solution
  • At first look, I prefer the VPS solution: it's cheap, we are totally independent and we have full flexibility (root access). See for example https://www.ovh.com/fr/vps/vps-classic.xml. I generally love working with Heroku but I've been recently suprised by all the hidden costs that appears once you need a few options (background processes, static file hosting, redis-queue, mail sending service, ...)

Next steps

  • Discuss all of the above
  • Agree on the workload division
  • Brainstorm on the basic interfaces (top importance: between webservices and client, but also the database that acts as an interface between crawler and webservices and the consistency of the whole dataflow.)
  • Code!

Get basis of record categories

Description

For a given dataset, I want to know how many records have a certain basis of record. I also want to know how many of those are invalid. I envision this as a bar chart, where the records are grouped in categories based on basis of record.

Outcome

dataset_key
bor_preserved_specimen  // Preserved specimens
bor_fossil_specimen     // Fossil specimens
bor_living_specimen     // Living specimens
bor_material_sample     // Material samples
bor_observation         // Observations
bor_human_observation   // Human observations
bor_machine_observation // Machine observations
bor_literature          // Literature occurrences
bor_unknown             // Unknown

Terms we need

basisOfRecord

Ideally, use null if count is 0.

Questions

  • BASIS_OF_RECORD_INVALID returns 0 results. Most likely covered by Unknown evidence, can thus be ignored.
  • The basisOfRecord categories that GBIF provides are mutually exclusive.

Process

/* Map basisOfRecord to categories */

hasCoordinate is always false

In the download file, the field hasCoordinate is provided, which is defined as:

Boolean indicating that a valid latitude and longitude exists. Even if existing it might still have issues, see hasGeospatialIssues and issue.

In the downloads we tested however, that field was always false, even when valid coordinates were available. In the API, it works as it should.

Get taxon match categories

Description

For a given dataset, I want to know how many records provide a taxon. I also want to know how many of those match the GBIF taxonomy and if there are any issues.

Outcome

dataset_key
taxon_not_provided
taxon_match_none
taxon_match_higherrank
taxon_match_fuzzy
taxon_match_complete

Terms we need

scientificName
genus
issues

Process

IF scientificName = "" OR genus = ""
    /* If scientificName is empty, GBIF builds a name with genus, specificEpithet, etc, see
       https://github.com/gbif/occurrence/blob/master/occurrence-processor/src/main/java/org/gbif/occurrence/processor/interpreting/TaxonomyInterpreter.java#L34
       If scientificName is empty, we can check for genus (no need to check other atomized fields)
       Note: TAXON_MATCH_NONE is applied for empty taxa (unless record was indexed before that
       issue was applied). */
    THEN category = "taxon_not_provided"
ELSEIF issues CONTAINS (TAXON_MATCH_NONE)
    THEN category ="taxon_match_none"
ELSEIF issues CONTAINS (TAXON_MATCH_HIGHERRANK)
    THEN category = "taxon_match_higherrank"
ELSEIF issues CONTAINS (TAXON_MATCH_FUZZY)
    THEN category = "taxon_match_fuzzy"
ELSE category = "taxon_match_complete"

Overview of metrics we could show

Stats

Occurrences completeness (using issues logged by GBIF)

  • Completeness percentage for each Darwin Core term
  • Completeness of scientific name + taxon issues
  • Completeness of coordinates + coordinate issues
  • Completeness of higher geography + country/continent issues
  • Completeness of recorded date + recorded date issues
  • Completeness of elevation + elevation issues
  • Completeness of depth + depth issues
  • Completeness of basis of record + basis or record issues
  • Completeness of multimedia (sound, images, videos)

Precision

  • Precision of scientific name
  • Precision of coordinates

Occurrence range

  • Taxonomic range (already done on stats page): krona, treemap, etc.
  • Location range (already done on map): map is best way
  • Decade range (already in analytics)
  • Day of year (already in analytics): results not that interesting
  • Elevation range
  • Depth range
  • Last modified range: better to aggregate this into one metric
  • Type status metrics (already done on stats page) + type status issues
  • Random 20 pictures to have something visual

Ranking (requires full GBIF data, not for POC)

  • Most downloads
  • Widest range (taxonomic, etc.)
  • Recentness

Dataset type

  • Detect dataset type
  • Citizen science datasets with images
  • Fossil datasets
  • Tracking datasets
  • Material samples
  • Vegetation plots
  • Other machine observations
  • Trawls
  • Type specimens

Metadata

Gauge meter for metadata, based on something we need to define.

  • Completeness of sections
  • Word count
  • ...

Fix header display issue

screen shot 2015-02-05 at 10 06 27

The extension causes a display issue on home and stats pages (not on activity). Need to figure out how to solve this.

eventDate can be set blank with no issue thrown

Provided eventDates such as:

Are set to blank (in download files and API), with no issue such as RECORDED_DATE_INVALID thrown. You would expect some issue like RECORDED_DATE_NOT_INTERPRETED to identify those.

List of all issues GBIF provides

  • BASIS_OF_RECORD_INVALID ignored in #28
  • CONTINENT_COUNTRY_MISMATCH
  • CONTINENT_DERIVED_FROM_COORDINATES
  • CONTINENT_INVALID
  • COORDINATE_INVALID major in #23
  • COORDINATE_OUT_OF_RANGE #23
  • COORDINATE_REPROJECTED ignored in #23
  • COORDINATE_REPROJECTION_FAILED minor in #23
  • COORDINATE_REPROJECTION_SUSPICIOUS minor in #23
  • COORDINATE_ROUNDED ignored in #23
  • COUNTRY_COORDINATE_MISMATCH major in #23
  • COUNTRY_DERIVED_FROM_COORDINATES
  • COUNTRY_INVALID
  • COUNTRY_MISMATCH
  • DEPTH_MIN_MAX_SWAPPED
  • DEPTH_NON_NUMERIC
  • DEPTH_NOT_METRIC
  • DEPTH_UNLIKELY
  • ELEVATION_MIN_MAC_SWAPPED
  • ELEVATION_NON_NUMERIC
  • ELEVATION_NOT_METRIC
  • ELEVATION_UNLIKELY
  • GEODETIC_DATUM_ASSUMED_WGS84 ignored in #23
  • GEODETIC_DATUM_INVALID minor in #23
  • IDENTIFIED_DATE_INVALID
  • IDENTIFIED_DATE_UNLIKELY
  • MODIFIED_DATE_INVALID
  • MODIFIED_DATE_UNLIKELY
  • MULTIMEDIA_DATE_INVALID ignored in #43
  • MULTIMEDIA_URI_INVALID used in #43
  • PRESUMED_NEGATED_LATITUDE ignored in #23
  • PRESUMED_NEGATED_LONGITUDE ignored in #23
  • PRESUMED_SWAPPED_COORDINATE ignored in #23
  • RECORDED_DATE_INVALID
  • RECORDED_DATE_MISMATCH
  • RECORDED_DATE_UNLIKELY
  • REFERENCES_URI_INVALID
  • TAXON_MATCH_FUZZY used in #39
  • TAXON_MATCH_HIGHER_RANK used in #39
  • TAXON_MATCH_NONE used in #39
  • TYPE_STATUS_INVALID
  • ZERO_COORDINATE major in #23

Cleanup/harmonize directory layout

@peterdesmet and I used specific directories for our code (extension, data_extension_module, ...) while @bartaelterman created a standard python package layout (src, bin, ...) at the repository root. We should agree on one approach and harmonize this before it becomes a mess.

Create downloads chart with timeseries x-axis

The downloads x-axis is currently based on an array with the number of days to be shown (from 0 to e.g. 50). It would be more flexible (and nicer for labels) if this x-axis was a time series:

  • No need to take care of dates with no downloads
  • Hoover would show the date instead of a non-descript number
  • One could more easily load more data in the chart (e.g. moving back and forth in time)

image

Quick code:

var chart2 = c3.generate({
            bindto: "#downloadsChart2",
            data: {
                x: "x",
                columns: [
                    ["x","2014-12-22","2014-12-23","2014-12-25","2014-12-27","2014-12-28","2014-12-29"],
                    ["downloads",1,2,5,1,1,3]
                ],
                type: "bar"
            },
            axis: {
                x: {
                    type: "timeseries",
                    tick: {
                        format: "%Y-%m-%d"
                    }
                }
            }
        });

Get taxonomy

I currently get the taxonomy data as:

{
    "Plantae": {
        "Pteridophyta": {
            "Polypodiopsida": {
                "Cyatheales": {
                    "Lophosoriaceae": 2,
                    "Cyatheaceae": 130,
                    "Dicksoniaceae": 216
                }
            }
        }
    }
}

Which is really concise. For the visualization though I need:

{
    "name": "Plantae",
    "children": [
    {
        "name": "Pteridophyta",
        "children": [
        {
            "name": "Polypodiopsida",
            "children": [
            {
                "name": "Cyatheales",
                "children": [
                    { "name": "Lophosoriaceae", "size": 2 },
                    { "name": "Cyatheaceae", "size": 130 },
                    { "name": "Dicksoniaceae", "size": 216 }
                ]
            }]
        }]
    }]
}

Any suggestions how to transform this?

Show number of records with valid coordinates

This metric is a combination of:

  1. decimalLatitude populated
  2. decimalLongitude populated
  3. No serious coordinate issues with either
  4. Optional: coordinates are in WGS84

Available terms

API Download Remarks
decimalLatitude yes yes
decimalLongitude yes yes
issues yes yes Controlled vocabulary, not limited to coordinate issues
geodeticDatum yes no
hasCoordinate no yes Boolean indicating that a valid latitude and longitude exists. Even if existing it might still have issues, see hasGeospatialIssues and issue. source
hasGeospatialIssues no yes Boolean indicating that some spatial validation rule has not passed. Primarily used to indicate that the record should not be displayed on a map. source

Proposal

Using hasCoordinate = true and hasGeospatialIssues = false seem like the best solution, even though it is a bit unclear which issues are ignored. Unfortunately for the download I tested (with plenty of valid coordinates), hasCoordinate was always false (Saved as a bug at #21). So, we'd have to test ourselves for decimalLatitude != empty AND decimalLongitude != empty.

hasGeospatialIssues can still be used: it uses true and false. I'll figure out which issues are matched.

Create date quality categories

Description

For a given dataset, I want to know how many records have dates. I also want to know how many of those are useful, have issues, and maybe what their precision is. I envision this as a bar chart, where the records are grouped in categories based on the quality of the dates.

Categories (in order of increasing data quality)

  • Date not provided
  • Date with major issues
  • Date with minor issues
  • Valuable date (all in ISO8601)

Questions

  • GBIF sets many dates to blank without trowing an issue (see #27).
  • GBIF makes no attempt at signaling dubious dates: in fact, it throws RECORDED_DATE_MISMATCH if day, year, month are correctly provided.
  • RECORDED_DATE_UNLIKELY also matches invalid dates: 99 XXX 9999
  • So the 3 relevant issues RECORDED_DATE_INVALID, RECORDED_DATE_MISMATCH and RECORDED_DATE_MISMATCH are of limited use to indicate the data quality. One approach to provide much more relevant date information, is to use the Canadensys Narwhal Processor.
  • If no eventDate is provided, GBIF doesn't seem to look in verbatimEventDate or year, month, day. The literal values of those fields are shown on the website though.

Terms we need

eventDate
issue
eventDate from verbatim.txt
verbatimEventDate
year
month
day

Process

IF eventDate != "" AND issue DOES NOT CONTAIN (
        RECORDED_DATE_MISMATCH
    )
    THEN category = "Valuable date (all in ISO8601)" /* Well, MM-DD-YYYY are still in there */
ELSEIF issue CONTAINS (
        RECORDED_DATE_MISMATCH /* The only issue that keep eventDate populated */
        )
        verbatim.txt.eventDate != "" /* Since GBIF empties eventDate (see #27) in occurrence.txt, 
            we'd have to look in verbatim.txt :( */
        OR verbatimEventDate != ""
        OR year != ""
        OR (year != "" AND month != "")
        OR (year != "" AND month != "" AND day !="")
    /* A date was provided */
    THEN category = "Date provided, but not interpreted by GBIF"
ELSE
    category = "Date not provided"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.