ediorg / ecocomdp Goto Github PK

View Code? Open in Web Editor NEW

32.0 32.0 13.0 37.26 MB

A dataset design pattern and R package for ecological community data.

Home Page: https://ediorg.github.io/ecocomDP/

License: Other

R 74.23% HTML 25.04% CSS 0.34% JavaScript 0.20% Python 0.18%

ecocomdp's People

Contributors

Stargazers

Watchers

Forkers

vanderbi bniyaseen karlbenedict yvanlebras mperezrocha83 sokole savannahrayegonzales kzollove

ecocomdp's Issues

distribution element should not refer to 'offline'

When data are uploaded from desktop to PASTA currently the code inserts a statement that the data are 'offline' in the distribution element. That causes PASTA to display it as 'offline' even though the data are there and can be downloaded.

Is package_id required?

If there is no parent data package of an ecocomDP then there won't be a package_id in the observation or dataset_summary tables.

remove box from ecocomDP.svg

the box keeps the image from displaying correctly in git. not needed anyway, is vestigal

add name of person performing the conversion to L1 metadata

comment from the LTER ASM. people would like to know who performed the conversion.
TBD: where is metadata to put this. some candidate elements
metadataProvider
maintenance
[custom field in additionalMetadata]

should not be considered:
creator (creator is reserved for intell contributions, not processing)

consider putting veg-DB into ecocomDP format

reference material
https://lternet.edu/wp-content/uploads/2018/01/2013-spring-lter-databits.pdf

update to documentation - variable_name

variable_name should be an enumerated list, with definitions

post the list of keywords

They are in the notes, put them in the repo, in a way that they can be linked externally

Look at these other content models

http://schemas.usgin.org/home/

Note: I got this link from an ESIP deep-dive on the NCEI "OneStop" data project (link below). In it Ken Casey references 3 "content models":
this one
Darwin Core
a defunct website called nepanode.anl.gov
https://speakerd.s3.amazonaws.com/presentations/59c1a1caa9d94d9c89e61c0a2a2729aa/OneStop_ESIP_Tech_Deep_Dive_-_13_July_2016.pdf

find package ids for popler imports

The popler folks did not get all their datasets from the repo. Many (most) were downloaded from sites' individual websites, and did not include the repository package id. These are labeled "NA" in the popler_knbid.csv file.

However, for every dataset I've looked for (approx 10, manually), a packageId exists. These are already in the list called L0_metacommunities.

possible solutions to filling in the "NAs":
A. continue manually (eew).
B. scrape URL and look for more info, eg, a DOI or packageId that was missed
C. query titles in pasta

Will start with option C - many sites now use the same title, even if they are not displaying a pasta packageID.

Popler (Aldo) is aware of this shortcoming in their process, and may come up with a way to gather DOIs instead of the url they currently use to link out to metadata (as URLs are already breaking).

add description of the ecocomDP model to dataset template

Suggested by a user (@vanderbi): datasets should include a diagram to help users understand the layout of the tables.

documentation standardized to match other EDI projects

It might be better in the wiki, rather than under code.

suggest a sketch of sampling strategy

we asked synth sci what was the easiest way for them to understand a sampling strategy. they said "visual" is much simpler than text. jpg is a good start.

add link here to Stevan session on lessons learned from synthesis: _______

write a guidelines, how to, or best practices

After having several datasets converted, we have some patterns emerging, both for the questions asked and practices. organize these into a set of practices or FAQs.

Develop a naming scheme, so that not all tables of a type have the same name.

Corinna's comment:

When I use the ecocom_dp for any new incoming community datasets, I can’t call the files all the same name. I.e., I can’t just use ‘observation’, ‘event’, etc. over and over again. I am already stumbling on this one dataset because they gave me raw observations (several per lake) and then summarized for each lake. I do want to archive both as people probably want both. So, what I have done for file name now is prefix it with the study and the postfix it with raw and summary. I.e., NTL_RS_Marcrophytes_observation_raw.csv and NTL_RS_Macrophytes_observation_summary.csv. Of course, they could go into one, but I am sure that would make it very difficult to use.

Table event

Table event doesn’t have a primary key because one event can have several variable and values pairs
Needs a record_id or something along those lines as primary key
event_id needs an index to be able to use it as foreign key in observation

integrate terms from Guralnick et al

citation:
Robert Guralnick , Ramona Walls and Walter Jetz 2017
Humboldt Core –toward a standardized capture of biological
inventories for biodiversity monitoring, modeling and assessment
Ecography 40: 001–012, doi: 10.1111/ecog.02942

Functions to validate field contents

Write validation checks for field contents, e.g. the latitude and longitude fields of the location table should be in decimal degrees.

person ids cause problems

In some original EML files all people have an id associated. When those people are then re-used somewhere else (provenance) in the ecocomDP EML the ids clash because they have to be unique within one EML document. So, best is to strip out all id attributes associated with people information. E.g. should be

/master/documentation/model/ecocomDP.png, ecocomDP.svg - check references

mob edited the svg by hand, and got one of the references wrong.
sampling_loc_ancillary references the wrong col in sampling_location.

check them all, re-export the png. I think this image is on the website, too, in the dataset design area, so check there too.

develop a way to include the dataset_summary in metadata, for discovery

The one line table "dataset_summary" contains info that could be metadata. If it were, it could be used in discovery/evaluation. Come up with a scheme to include it as metadata.

some indexes should not be integers

taxon_id should not be integer, they are very frequently strings (at least at NTL, and I made them string when I converted the cdr dataset)
sampling_location_id should not be integer but character (varchar in MySQL)

redo infographic

Problem: the xrefs are not depicted correctly.
Fix in the sql, put in the xref tables that the db needs. then block them out.

sampling_location table required but column content is not

Data pattern description (here) requires the sampling_location table, however the table requires no more than a sampling_location_id. Is this correct? I thought latitude and longitude were minimum requirements.

primary keys, unique at some level

we need to make sure they will be unique when many datasets are being put together for an analysis.

table dataset_summary

several fields in dataset_summary should not be required. I am attaching a MySQL schema for reference.

add the mysql implementation to repo

Text:

/*
Navicat MySQL Data Transfer

Source Server : localhost
Source Server Version : 50513
Source Host : localhost:3306
Source Database : ecocom_dp

Target Server Type : MYSQL
Target Server Version : 50513
File Encoding : 65001

Date: 2017-07-11 16:23:12
*/

SET FOREIGN_KEY_CHECKS=0;

-- Table structure for dataset_summary

DROP TABLE IF EXISTS dataset_summary;
CREATE TABLE dataset_summary (
dataset_summary_id int(11) NOT NULL ,
original_dataset_id varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL ,
length_of_survey_years int(11) NULL DEFAULT NULL ,
number_of_years_sampled int(11) NULL DEFAULT NULL ,
std_dev_interval_betwe_years float NULL DEFAULT NULL ,
max_num_taxa int(11) NULL DEFAULT NULL ,
geo_extent_bounding_box_m2 float NULL DEFAULT NULL ,
PRIMARY KEY (dataset_summary_id)
)
ENGINE=InnoDB
DEFAULT CHARACTER SET=utf8 COLLATE=utf8_general_ci
COMMENT='summary statistics to evaluate the usefluness of dataset'

;

-- Table structure for event

DROP TABLE IF EXISTS event;
CREATE TABLE event (
unique_id int(11) NOT NULL ,
event_id int(11) NOT NULL ,
variable_name varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
value varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
PRIMARY KEY (unique_id)
)
ENGINE=InnoDB
DEFAULT CHARACTER SET=utf8 COLLATE=utf8_general_ci
COMMENT='info about a sampling event, eg, conditions, weather'

;

-- Table structure for observation

DROP TABLE IF EXISTS observation;
CREATE TABLE observation (
observation_id int(11) NOT NULL ,
event_id int(11) NOT NULL ,
dataset_summary_id int(11) NOT NULL ,
sampling_location_id varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
observation_datetime datetime NOT NULL ,
taxon_id varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
variable_name varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
value float NOT NULL ,
unit varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
PRIMARY KEY (observation_id),
FOREIGN KEY (event_id) REFERENCES event (event_id) ON DELETE RESTRICT ON UPDATE RESTRICT,
FOREIGN KEY (sampling_location_id) REFERENCES sampling_location (sampling_location_id) ON DELETE RESTRICT ON UPDATE RESTRICT,
FOREIGN KEY (dataset_summary_id) REFERENCES dataset_summary (dataset_summary_id) ON DELETE RESTRICT ON UPDATE RESTRICT,
FOREIGN KEY (taxon_id) REFERENCES taxon (taxon_id) ON DELETE RESTRICT ON UPDATE RESTRICT
)
ENGINE=InnoDB
DEFAULT CHARACTER SET=utf8 COLLATE=utf8_general_ci
COMMENT='table holds all the primary obs, with links'

;

-- Table structure for sampling_location

DROP TABLE IF EXISTS sampling_location;
CREATE TABLE sampling_location (
sampling_location_id varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
sampling_location_name varchar(500) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL ,
latitude float NULL DEFAULT NULL ,
longitude float NULL DEFAULT NULL ,
parent_sampling_location_id int(11) NULL DEFAULT NULL ,
PRIMARY KEY (sampling_location_id)
)
ENGINE=InnoDB
DEFAULT CHARACTER SET=utf8 COLLATE=utf8_general_ci

;

-- Table structure for sampling_location_ancillary

DROP TABLE IF EXISTS sampling_location_ancillary;
CREATE TABLE sampling_location_ancillary (
sampling_location_ancillary_id int(11) NOT NULL ,
sampling_location_id varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
datetime datetime NOT NULL ,
variable_name varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
value varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
unit varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL ,
PRIMARY KEY (sampling_location_ancillary_id),
FOREIGN KEY (sampling_location_id) REFERENCES sampling_location (sampling_location_id) ON DELETE RESTRICT ON UPDATE RESTRICT
)
ENGINE=InnoDB
DEFAULT CHARACTER SET=utf8 COLLATE=utf8_general_ci
COMMENT='info at each location during sampling'

;

-- Table structure for taxon

DROP TABLE IF EXISTS taxon;
CREATE TABLE taxon (
taxon_id varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
taxon_rank varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL ,
taxon_name varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
authority_system varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL ,
authority_taxon_id varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL ,
PRIMARY KEY (taxon_id)
)
ENGINE=InnoDB
DEFAULT CHARACTER SET=utf8 COLLATE=utf8_general_ci
COMMENT='taxonomic data'

;

-- Indexes structure for table event

CREATE INDEX event_id_index ON event(event_id) USING BTREE ;

-- Indexes structure for table observation

CREATE INDEX location_fk ON observation(sampling_location_id) USING BTREE ;
CREATE INDEX summary_fk ON observation(dataset_summary_id) USING BTREE ;
CREATE INDEX taxon_fk ON observation(taxon_id) USING BTREE ;
CREATE INDEX event_fk ON observation(event_id) USING BTREE ;

-- Indexes structure for table sampling_location_ancillary

CREATE INDEX location_ancillary_fk ON sampling_location_ancillary(sampling_location_id) USING BTREE ;
margarets-Mac-mini:Downloads mob$

Create documentation

the dir is /documentation/. use a combination of the original google spreadsheet, sql implementations and our experience with the first few datasets.

write up the recommended munging process

outline the process for 9/26 VTC

add two columns to observation table, for measurement dictionary

make them analogous to the ones in taxon table

documentation for the optional 8th table

table description in markdown
add table to postgres code
run schemaSpy, commit
add table added model png

Validation check: event_id

The event_id field should be required only if the observation_ancillary table is present. Implement this change in the validate_column_presence function.

EML template should include annotations, per v 2.2

EML-2.2 will have annotations for dataset, entity, attribute.
some ecocom nodes map directly to terms in vocabs:
darwin core
humboldt core

related to #24, #42

Recommendations about what to put in sampling_location_ancillary, what to put in event

The next decision I am struggling with is, how much to put into the sampling_location_ancillary. There is a pretty wide table with lake conditions (~25 of them) should I turn them all into the long format or just leave the table as is and attach? It’ll be needed by pretty much all datasets from that survey, which includes water quality that would not go into the ecocom_dp.

Functions to convert NEON to ecocomDP

Develop functions to display available NEON community data products and convert selections to the ecocomDP. Resultant ecocomDPs may be written to file, but not archived in the EDI data repo.

start a list of measurement vocabulary resources

each one will need:

description
1. domain it covers
2. format it uses
3. what it doesn't cover, caveats, gotchas
instructions for how to navigate
examples

unique constraints for ecocomDP tables

script should check that a group of fields in each table is unique:
observation table:
observation_id, event_id, package_id, sampling_location_id, observation_datetime, taxon_id, variable_name

others tbd. expect the uniq group to be all the ids (except the record_id) up through variable_name (not value or unit). in other words, each row is uniq except for record_id, value, unit

script for long-to-wide for primary tables

To help make datasets in this model easy to use, we should put the three primary tables together as a single wide dataset.
Details TBD.
will need to include the ids so ancillary data could be added on (by the user, ad hoc)

clarify definition of timestamp

That 'timestamp' is the postgres type for a datetime/date/time. make it clear that this is not a table- or db-creation time

Table sampling_location_ancillary

Table sampling_location_ancillary
sampling_location_id varchar not integer
check to make sure that datetime is typed correctly in posgres, (eg, should it be datetime not timestamp)

reexamine popler

the popler model was presented last week at ESA. We should revisit it to see overlap. My impression is that ours is more general, but if popler has traction (e.g., via working groups), we should use it (or parts)

Note below from C, via slack:
[8:31 AM]
hey all, sitting in a Popler talk, it sure sounds like we should have used their database. I do remember deciding that it was too complicated ...

in the presentation it sounds like they already cleaned up the taxonomy for most of the LTER datasets

merge datasets in L0_metacommunities.txt, popler-known-ids into processing queue, prioritize

priorities according to WG status:

"in prep" (so they can use asap)
"complete" (they have already made their L3, but if there is an update to L0, they may want it)
"lets talk" (they have not yet decided on if/how to use this dataset - will become either inprep or rejected)
"rejected" (not used for this project, but maybe for some other)

add db-style constraints to ecocomDP EML

pre comment today (during webinar), ecocomDP datasets are an ideal candidate for constraints described in metadata. Logging this here for future reference; its lower priority, right now.

Table sampling_location

Table sampling_location
sampling_location_id varchar not integer

evaluate similarity to DwCA and OBOE

Interesting model, @mobb, and nice work.

Your model seems quite convergent with the Darwin Core Archive (DwCA) format, which allows one to represent species-based sampling data in a standardized set of tables, and is the main mechanism for publishing data to GBIF. Have you considered whether you could achieve some sort of semantic parity with the DwCA model, especially on concepts like Observation, Event, and Taxon, all of which have received widespread debate and definition in the DwC world?

Also, can you tie your variable_name and similar table attributes to OBOE:Characteristic types so that we would have more than english names to suss out what these variables are? I've been working on a semantics extension to EML to allow just that, but because the ecocom format puts the column definitions into the rows of these tables, it would require additional mechanisms in EML to associate the formal semantics of the variables. It would be so great to have more than English names for the variables if you are going to this level of trouble to standardize. I was hoping @mobb and @mpsaloha would both be reviewing the EML semantics model soon!

Just some thoughts upon seeing your new work, feel free to close this issue if there's nothing to be done.

table taxon

taxon_level should be taxon_rank – I think I called it level, and that’s not what’s generally used
taxon_id varchar not integer

add priority to processing queue

Processing queue is a great idea! thanks. Can we add a column called priority? of course things can move around (change order), but it would be a field to order-by.

I suggest 1 for most important, and then When a dataset is done, it's priority can get set to some really big number. I think leaving the finished ones on the list is a good idea, in case they need to be revisited.

Start a collection of papers for reference

Page R (2018) Liberating links between datasets using lightweight data publishing: an example using plant names and the taxonomic literature. Biodiversity Data Journal 6: e27539. https://doi.org/10.3897/BDJ.6.e27539]
https://bdj.pensoft.net/article/27539/

write script to generate informational plots from datasets

include a 1pp PDF as another data object.

examples from metacommunities WG here:
https://github.com/sokole/ltermetacommunities/tree/master/MS3-Supp-Info
their R-code is available.

This group plotted
site vs. date (sampling homogeneity)
taxa total by plot (shared taxa) - plot similarity
taxon accumulation curve
number of taxa/site/year (annual richness) to highlight possible changes to methods

add script to check for variables_mappings table

we added an optional 8th table to the model, for mappings between variable names and dictionaries. we need a script to check for it's presence, and refs

make_eml() lists other.entity as "offline"

The actual entities are being documented and archived, but they are listed as "offline".

ediorg / ecocomdp Goto Github PK

ecocomdp's People

Contributors

Stargazers

Watchers

Forkers

ecocomdp's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs