GithubHelp home page GithubHelp logo

nceas / datateam-training Goto Github PK

View Code? Open in Web Editor NEW
7.0 7.0 21.0 21 MB

Training and reference materials for ADC and SASAP data team members

Home Page: https://nceas.github.io/datateam-training/training/

License: Apache License 2.0

CSS 23.70% R 76.30%

datateam-training's People

Contributors

cwbeltz avatar dmullen17 avatar drkrynstrng avatar dvirlar2 avatar emilyodean avatar isteves avatar jeanetteclark avatar jkibele avatar kellywang1126 avatar laijasmine avatar maier-m avatar mayasamet avatar rachelsun97 avatar robyngit avatar sharisochs avatar smfreund avatar stao1 avatar veeveetran avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

datateam-training's Issues

R 4.0.0 future updates

Keeping a running list of things that need updating when we do move to R 4.0.0

I don't anticipate a lot but just preventing myself from forgetting.

  • need to make sure arcticdatautils is available for R 4.0.0
  • remove stringsAsFactors = FALSE from attribute data.table creation

adding unitlist to additional metadata

I was trying to add custom units based on the editing EML training

I was couldn't run the line (appears to be coercing the unitlist to additionalMetadata):
doc$additionalMetadata <- c(as(unitlist, "additionalMetadata"))

The Error seems to be as(unitlist, "additionalMetadata"):

Error in as(unitlist, "additionalMetadata") : 
  no method or default for coercing “list” to “additionalMetadata”

is it ok to go ahead and doc$additionalMetadata <- unitlist instead?

4.1 clarify text on pids

rm_pid <- "your_resource_map_pid"

pkg <- get_package(adc_test,
                   rm_pid,
                   file_names = TRUE)
  • add more guiding text to help clarify how to get pids through R and refer back to this function whenever the document asks for pids
  • refer users back to chapter 2 if you need a refresher

`formatId` should be `format_id` in section 3.4

Section 3.4 has the following example code:

pid <- publish_object(adc_test,
                      path = path,
                      formatId = formatId)

...but the correct keyword argument for publish_object is format_id, not formatId. The code chunk should be:

pid <- publish_object(adc_test,
                      path = path,
                      format_id = formatId)

4.3 remove print button directions ?

In my version of the shiny_attributes app there doesn't seem to be the print button mentioned. Only a Download, Quit App and Help buttons. Is the workflow different now?

Once you are done editing a table in the app, click the Print button to print text of a code that will build a data.frame in R. Copy that code and assign it to a variable in your script (e.g. attributes - data.frame(...)).

Screen Shot Shiny App

Training revisions

Revising sections

  • Solr - Irene (Emily can still revise/etc!)
  • Git/RStudio - Emily (Emily, I can add more about the RStudio part if you don't do pointy clicky)
  • Formatting in EML (currently on Enterprise) - Irene
  • Add section about EML references (currently on Enterprise)
  • Expand on the exploring EML schema section in References: Editing EML Steph
  • Insert more links to other training sections within existing text Steph

New sections

  • Troubleshooting (how to deal with R errors) - Irene will start this/contributions welcome
  • Downloading data directly (from ADC, RT) - Irene/Mitchell
  • update_package_object - Irene written, needs to be incorporated
  • qa_package - Emily
  • using tmux for parallel processing - Irene (someone else can grab this from me if they wish!)
  • add_creator_id - Irene mostly written, might (?) be deprecated when new editor is released
  • getting attributes from shapefiles - Irene mostly written
  • show_indexing_status
  • remove_public_read/set_public_read
  • janitor::excel_numeric_to_date(), get_dupes(), clean_names() (vignette)
  • eml_get/data exploration section - Irene
  • SASAP projects - Steph
  • add checklist: https://github.nceas.ucsb.edu/KNB/arctic-data/blob/master/datateam/How_To/Checklist.md

Useful workflows (intern contributions)

Sharis

  • Adding taxonomic coverage
  • Adding single data temporal coverage
  • Adding data tables for a whole folder of files with the same attributes
  • Adding a pre generated DOI to the eml
  • Obsolescence chain
  • Adding sampling info in methods
  • Set rights and access
  • Working with NetCDF’s

Vivian

  • Reading in data
    • Single file
    • Multiple files
  • Removing rows and columns with all blank cells
  • Indices of cells that contain a certain string (or part of a string)
  • Reformatting dates/times (YYYY-MM-DDThh:mm:ss)

Other To-do's: update training with datamgmt functions
(moved from google doc)

Text doesn't match what's happening

In 3.2, it says:

For example, let’s take a look at eml-party. To start off, notice that some elements are in solid boxes, whereas others are in dashed boxes.

Elements are not in solid/dashed boxes...in fact, the EML-party schema looks nothing like the screenshots on the training page.

Add eml_get() to training

Not sure yet about the limits of the function, but it seems like a useful way to get into the EML without having to go in super deep with @ and [[1]].

Example: eml_get(eml, "methodStep")

Add new arcticdatautils functions to Editing EML chapter

doc <- eml_add_publisher(doc)
doc <- eml_add_entity_system(doc)

The result won't show up on the webpage but it should add a publisher element to the dataset element and a system to all of the entities based on what the PID is. This will help make our metadata more FAIR (Findable, Accessible, Interoperable, Reusable). Let me know if you run into issues!

Add super/subscript to references

Add section on adding superscript and subscript to abstract / methods. Abstract is more straightforward since there as there's only one section.

RT settings

change user preferences to immediately and Show oldest history first "No"

2 broken links in 1st paragraph of datateam-training/workflows/explore_eml/understand_eml_schema.Rmd

Both links in this portion of the paragraph are broken -

Additional information on the schema and how different elements (or "slots") are defined can be found [here](https://knb.ecoinformatics.org/#external//emlparser/docs/eml-2.1.1/index.html)). Further explanations of the symbology can be found [here](https://manual.altova.com/xmlspy/spyenterprise/index.html?xseditingviews_schv_cmview_objects.htm).

@jeanetteclark I poked around a little but wasn't sure if I was truly finding webpages that would be appropriate replacements, I think you might be a better judge of that / be able to think of some off the top of your head

EML 2.2.0 updates

Keeping track on what needs to be updated when we switch to EML 2.2.0

  • update final_review_checklist
  • update publish_an_object
  • add section on citations

Remove rawToChar from eml call

Not sure how many times this appears, but it can be changed from eml <- EML::read_eml(rawToChar(dataone::getObject(mnT, pkg$metadata))) to eml <- EML::read_eml(dataone::getObject(mnT, pkg$metadata))

add section describing DBO specific considerations

for all DBO datasets:

  • the group CN=DBO,DC=dataone,DC=org should have readPermission and writePermission
  • geographic coverage should be one coverage per DBO line, with the geographicDescription and bounding coordinates matching those in the attached file
  • the name of the ship the data were collected from should be listed somewhere in the metadata record

geo_locs.txt

Add commonly used custom units

Add commonly used custom units - much like the example solr queries section.

Some of the top of my head:
[] partsPerMillion
[] partsPerThousand
[] wattsPerSquareMeter

Data Team Training Issues

  • Broken link in 1.2 Effective Data Management to Matt Jones et al.’s paper on effective data management
  • 404 Webpage not found for eml-party (2x) and eml-attribute, and eml-physical under 3.2 "Understand the EML schema"
  • Link not found for "exploring EML (more on that here)" under 3.3 "Access specific elements"
  • 404 Webpage not found for "attributeList" under 4.3 "Edit attributeLists"
  • 5.7.1 - Blank link - Under provided dataset "Nothing was found"

Add recover_failed_submission to reference guide

We should add a short section here about using the recover_failed_submission function. It should include details about what happens when a submission fails - metacatUI catches a submission error and uploads the EML text as a data object instead. Next, we can use the recover_failed_submission function to try to remove the error text and get a valid EML document - note that this may not always work based on the error. Finally we want to upload the recovered document as a metadata document and set the rights and access to the correct submitter.

Add more robust guidelines for dealing with packages

Related to work @maier-m is already doing with Kathryn/arctic-outreach.

Some questions that need to be answered:

  • When is it appropriate to touch the PI's data?
    - Excel to csv/txt changes are ok, but we prefer if the PI does it themselves
    - Changing headers are ok if there are other changes to be made; needs to be documented in the description and (preferably) also an R script
    - Changed files should be linked to originals with prov (in the future, we may want to obsolete the old versions of files and link them via prov, but prov is not yet robust enough for this)
  • What constitutes a "good enough" methods/abstract?

add taxonomicCoverage to checklist

all biological datasets should include some taxonomic coverage. We need to add this to the EML editing section and the final checklist

break up chapter 4 and exercise 3

Make exercise 3 into part A, B , C etc.

  • will help chapter feel less long
  • gives a little more guidance to the user

ie exercise 3a - create attribute table, exercise 3b - set physical, exercise 3c review using checklist

Custom units example

I noticed that we don't have an example of creating a custom units data frame in section 4.3.2 of Editing EML. Additionally some of the references to the datamgmt functions in that section are now deprecated.

  • Create an example custom_units data frame with 3ish custom units
  • Remove any deprecated datamgmt functions and replace those workflows with appropriate instructions

explain `cn` and `mn` differences more clearly

Section 3.4 I think could benefit from a more robust description of what the different nodes are. This is the question I got about this section:

On step 3.4, do we set the PROD nodes before setting the Staging nodes? The staging nodes uses cn, which is in the PROD node. When I set the Staging node in the console I get an error saying object: cn is not found. (I dont want to set something up on accident and end up submitting the training set using the PROD node).

and the answer:
PROD and STAGING are two different coordinating nodes (`cn`) and each of those coordinating nodes has many member nodes (`mn`) including KNB and Arctic which have different names depending on which coordinating node you are working in

Introduction to Solr - More Use Cases

@isteves Some more use cases for Solr query would be helpful. For instance, what would the workflow look like? When would it be helpful to query Solr in context of the Arctic Data Center/data processing?

Replace links with html

@jagoldstein's idea: Links direct in the current tab instead of opening a new one. We should replace the all markdown training links with the following html format: <a href="http://example.com/" target="_blank">example</a>

Solr training issues

(1) Add tip for using "obsoletedBy" field to find only the most recent versions of the packages you're searching for.
(2) Add an example of a query that looks for packages where fields are missing (e.g. -keywords:*)
(3) The link highlighted at the bottom of the page in this image is broken.
image

Clarify 2.6 to use test.arcticdata.io

I had a little bit of difficulty figuring out why I couldn't publish_object until I realized I got my token from the regular arcticdata.io rather than the test.arcticdata.io because I had the two open from earlier because I was trying to code along with the document.

I would suggest to either clarify 2.6 to use test.arcticdata.io to get the token to prevent the user from following the hyperlink in 2.3 to the regular site

add to chapter 8 - merge tickets

add information on how to merge multiple tickets (PI submitting multiple related datasets) into one for consolidated response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.