GithubHelp home page GithubHelp logo

ediorg / ecocomdp Goto Github PK

View Code? Open in Web Editor NEW
32.0 32.0 13.0 37.26 MB

A dataset design pattern and R package for ecological community data.

Home Page: https://ediorg.github.io/ecocomDP/

License: Other

R 74.23% HTML 25.04% CSS 0.34% JavaScript 0.20% Python 0.18%

ecocomdp's People

Contributors

cgries avatar clnsmth avatar karinorman avatar kzollove avatar mobb avatar sarapaull avatar savannahrayegonzales avatar sokole avatar will-rosenthal avatar yvanlebras avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ecocomdp's Issues

distribution element should not refer to 'offline'

When data are uploaded from desktop to PASTA currently the code inserts a statement that the data are 'offline' in the distribution element. That causes PASTA to display it as 'offline' even though the data are there and can be downloaded.

Is package_id required?

If there is no parent data package of an ecocomDP then there won't be a package_id in the observation or dataset_summary tables.

add name of person performing the conversion to L1 metadata

comment from the LTER ASM. people would like to know who performed the conversion.
TBD: where is metadata to put this. some candidate elements
metadataProvider
maintenance
[custom field in additionalMetadata]

should not be considered:
creator (creator is reserved for intell contributions, not processing)

find package ids for popler imports

The popler folks did not get all their datasets from the repo. Many (most) were downloaded from sites' individual websites, and did not include the repository package id. These are labeled "NA" in the popler_knbid.csv file.

However, for every dataset I've looked for (approx 10, manually), a packageId exists. These are already in the list called L0_metacommunities.

possible solutions to filling in the "NAs":
A. continue manually (eew).
B. scrape URL and look for more info, eg, a DOI or packageId that was missed
C. query titles in pasta

Will start with option C - many sites now use the same title, even if they are not displaying a pasta packageID.

Popler (Aldo) is aware of this shortcoming in their process, and may come up with a way to gather DOIs instead of the url they currently use to link out to metadata (as URLs are already breaking).

suggest a sketch of sampling strategy

we asked synth sci what was the easiest way for them to understand a sampling strategy. they said "visual" is much simpler than text. jpg is a good start.

add link here to Stevan session on lessons learned from synthesis: _______

Develop a naming scheme, so that not all tables of a type have the same name.

Corinna's comment:

When I use the ecocom_dp for any new incoming community datasets, I can’t call the files all the same name. I.e., I can’t just use ‘observation’, ‘event’, etc. over and over again. I am already stumbling on this one dataset because they gave me raw observations (several per lake) and then summarized for each lake. I do want to archive both as people probably want both. So, what I have done for file name now is prefix it with the study and the postfix it with raw and summary. I.e., NTL_RS_Marcrophytes_observation_raw.csv and NTL_RS_Macrophytes_observation_summary.csv. Of course, they could go into one, but I am sure that would make it very difficult to use.

Table event

Table event doesn’t have a primary key because one event can have several variable and values pairs
Needs a record_id or something along those lines as primary key
event_id needs an index to be able to use it as foreign key in observation

integrate terms from Guralnick et al

citation:
Robert Guralnick , Ramona Walls and Walter Jetz 2017
Humboldt Core –toward a standardized capture of biological
inventories for biodiversity monitoring, modeling and assessment
Ecography 40: 001–012, doi: 10.1111/ecog.02942

person ids cause problems

In some original EML files all people have an id associated. When those people are then re-used somewhere else (provenance) in the ecocomDP EML the ids clash because they have to be unique within one EML document. So, best is to strip out all id attributes associated with people information. E.g. should be

some indexes should not be integers

taxon_id should not be integer, they are very frequently strings (at least at NTL, and I made them string when I converted the cdr dataset)
sampling_location_id should not be integer but character (varchar in MySQL)

redo infographic

Problem: the xrefs are not depicted correctly.
Fix in the sql, put in the xref tables that the db needs. then block them out.

table dataset_summary

several fields in dataset_summary should not be required. I am attaching a MySQL schema for reference.

add the mysql implementation to repo

Text:

/*
Navicat MySQL Data Transfer

Source Server : localhost
Source Server Version : 50513
Source Host : localhost:3306
Source Database : ecocom_dp

Target Server Type : MYSQL
Target Server Version : 50513
File Encoding : 65001

Date: 2017-07-11 16:23:12
*/

SET FOREIGN_KEY_CHECKS=0;


-- Table structure for dataset_summary


DROP TABLE IF EXISTS dataset_summary;
CREATE TABLE dataset_summary (
dataset_summary_id int(11) NOT NULL ,
original_dataset_id varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL ,
length_of_survey_years int(11) NULL DEFAULT NULL ,
number_of_years_sampled int(11) NULL DEFAULT NULL ,
std_dev_interval_betwe_years float NULL DEFAULT NULL ,
max_num_taxa int(11) NULL DEFAULT NULL ,
geo_extent_bounding_box_m2 float NULL DEFAULT NULL ,
PRIMARY KEY (dataset_summary_id)
)
ENGINE=InnoDB
DEFAULT CHARACTER SET=utf8 COLLATE=utf8_general_ci
COMMENT='summary statistics to evaluate the usefluness of dataset'

;


-- Table structure for event


DROP TABLE IF EXISTS event;
CREATE TABLE event (
unique_id int(11) NOT NULL ,
event_id int(11) NOT NULL ,
variable_name varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
value varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
PRIMARY KEY (unique_id)
)
ENGINE=InnoDB
DEFAULT CHARACTER SET=utf8 COLLATE=utf8_general_ci
COMMENT='info about a sampling event, eg, conditions, weather'

;


-- Table structure for observation


DROP TABLE IF EXISTS observation;
CREATE TABLE observation (
observation_id int(11) NOT NULL ,
event_id int(11) NOT NULL ,
dataset_summary_id int(11) NOT NULL ,
sampling_location_id varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
observation_datetime datetime NOT NULL ,
taxon_id varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
variable_name varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
value float NOT NULL ,
unit varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
PRIMARY KEY (observation_id),
FOREIGN KEY (event_id) REFERENCES event (event_id) ON DELETE RESTRICT ON UPDATE RESTRICT,
FOREIGN KEY (sampling_location_id) REFERENCES sampling_location (sampling_location_id) ON DELETE RESTRICT ON UPDATE RESTRICT,
FOREIGN KEY (dataset_summary_id) REFERENCES dataset_summary (dataset_summary_id) ON DELETE RESTRICT ON UPDATE RESTRICT,
FOREIGN KEY (taxon_id) REFERENCES taxon (taxon_id) ON DELETE RESTRICT ON UPDATE RESTRICT
)
ENGINE=InnoDB
DEFAULT CHARACTER SET=utf8 COLLATE=utf8_general_ci
COMMENT='table holds all the primary obs, with links'

;


-- Table structure for sampling_location


DROP TABLE IF EXISTS sampling_location;
CREATE TABLE sampling_location (
sampling_location_id varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
sampling_location_name varchar(500) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL ,
latitude float NULL DEFAULT NULL ,
longitude float NULL DEFAULT NULL ,
parent_sampling_location_id int(11) NULL DEFAULT NULL ,
PRIMARY KEY (sampling_location_id)
)
ENGINE=InnoDB
DEFAULT CHARACTER SET=utf8 COLLATE=utf8_general_ci

;


-- Table structure for sampling_location_ancillary


DROP TABLE IF EXISTS sampling_location_ancillary;
CREATE TABLE sampling_location_ancillary (
sampling_location_ancillary_id int(11) NOT NULL ,
sampling_location_id varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
datetime datetime NOT NULL ,
variable_name varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
value varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
unit varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL ,
PRIMARY KEY (sampling_location_ancillary_id),
FOREIGN KEY (sampling_location_id) REFERENCES sampling_location (sampling_location_id) ON DELETE RESTRICT ON UPDATE RESTRICT
)
ENGINE=InnoDB
DEFAULT CHARACTER SET=utf8 COLLATE=utf8_general_ci
COMMENT='info at each location during sampling'

;


-- Table structure for taxon


DROP TABLE IF EXISTS taxon;
CREATE TABLE taxon (
taxon_id varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
taxon_rank varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL ,
taxon_name varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL ,
authority_system varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL ,
authority_taxon_id varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL ,
PRIMARY KEY (taxon_id)
)
ENGINE=InnoDB
DEFAULT CHARACTER SET=utf8 COLLATE=utf8_general_ci
COMMENT='taxonomic data'

;


-- Indexes structure for table event


CREATE INDEX event_id_index ON event(event_id) USING BTREE ;


-- Indexes structure for table observation


CREATE INDEX location_fk ON observation(sampling_location_id) USING BTREE ;
CREATE INDEX summary_fk ON observation(dataset_summary_id) USING BTREE ;
CREATE INDEX taxon_fk ON observation(taxon_id) USING BTREE ;
CREATE INDEX event_fk ON observation(event_id) USING BTREE ;


-- Indexes structure for table sampling_location_ancillary


CREATE INDEX location_ancillary_fk ON sampling_location_ancillary(sampling_location_id) USING BTREE ;
margarets-Mac-mini:Downloads mob$

Create documentation

the dir is /documentation/. use a combination of the original google spreadsheet, sql implementations and our experience with the first few datasets.

Validation check: event_id

The event_id field should be required only if the observation_ancillary table is present. Implement this change in the validate_column_presence function.

Recommendations about what to put in sampling_location_ancillary, what to put in event

The next decision I am struggling with is, how much to put into the sampling_location_ancillary. There is a pretty wide table with lake conditions (~25 of them) should I turn them all into the long format or just leave the table as is and attach? It’ll be needed by pretty much all datasets from that survey, which includes water quality that would not go into the ecocom_dp.

Functions to convert NEON to ecocomDP

Develop functions to display available NEON community data products and convert selections to the ecocomDP. Resultant ecocomDPs may be written to file, but not archived in the EDI data repo.

unique constraints for ecocomDP tables

script should check that a group of fields in each table is unique:
observation table:
observation_id, event_id, package_id, sampling_location_id, observation_datetime, taxon_id, variable_name

others tbd. expect the uniq group to be all the ids (except the record_id) up through variable_name (not value or unit). in other words, each row is uniq except for record_id, value, unit

script for long-to-wide for primary tables

To help make datasets in this model easy to use, we should put the three primary tables together as a single wide dataset.
Details TBD.
will need to include the ids so ancillary data could be added on (by the user, ad hoc)

clarify definition of timestamp

That 'timestamp' is the postgres type for a datetime/date/time. make it clear that this is not a table- or db-creation time

Table sampling_location_ancillary

Table sampling_location_ancillary
sampling_location_id varchar not integer
check to make sure that datetime is typed correctly in posgres, (eg, should it be datetime not timestamp)

reexamine popler

the popler model was presented last week at ESA. We should revisit it to see overlap. My impression is that ours is more general, but if popler has traction (e.g., via working groups), we should use it (or parts)

Note below from C, via slack:
[8:31 AM]
hey all, sitting in a Popler talk, it sure sounds like we should have used their database. I do remember deciding that it was too complicated ...

in the presentation it sounds like they already cleaned up the taxonomy for most of the LTER datasets

add db-style constraints to ecocomDP EML

pre comment today (during webinar), ecocomDP datasets are an ideal candidate for constraints described in metadata. Logging this here for future reference; its lower priority, right now.

evaluate similarity to DwCA and OBOE

Interesting model, @mobb, and nice work.

Your model seems quite convergent with the Darwin Core Archive (DwCA) format, which allows one to represent species-based sampling data in a standardized set of tables, and is the main mechanism for publishing data to GBIF. Have you considered whether you could achieve some sort of semantic parity with the DwCA model, especially on concepts like Observation, Event, and Taxon, all of which have received widespread debate and definition in the DwC world?

Also, can you tie your variable_name and similar table attributes to OBOE:Characteristic types so that we would have more than english names to suss out what these variables are? I've been working on a semantics extension to EML to allow just that, but because the ecocom format puts the column definitions into the rows of these tables, it would require additional mechanisms in EML to associate the formal semantics of the variables. It would be so great to have more than English names for the variables if you are going to this level of trouble to standardize. I was hoping @mobb and @mpsaloha would both be reviewing the EML semantics model soon!

Just some thoughts upon seeing your new work, feel free to close this issue if there's nothing to be done.

table taxon

taxon_level should be taxon_rank – I think I called it level, and that’s not what’s generally used
taxon_id varchar not integer

add priority to processing queue

Processing queue is a great idea! thanks. Can we add a column called priority? of course things can move around (change order), but it would be a field to order-by.

I suggest 1 for most important, and then When a dataset is done, it's priority can get set to some really big number. I think leaving the finished ones on the list is a good idea, in case they need to be revisited.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.