The reusabledata-staging from kltm

EPIC for main resource data curation

This top-level item is to coordinate the main curation effort. Add all sources to do in alphabetical order below. There is a curation request, in double alphabetical order, by names for @jmcmurry @kltm @lrwyatt @lwinfree @mellybelly @rchampieux

When a annotation on a resource is completed, please remember to switch status to complete.

Remember, if you find any criteria violations, you'll likely need to create a new field license-issues; see the schema below.

The data resource repo is here: https://github.com/kltm/reusabledata-staging/tree/master/data-sources
I think we would prefer a PR workflow at this point. If you have questions about that, feel free to ask somebody who might know. The curation and editing should be able to be done directly through the GitHub interface, but feel free to use any method you're comfortable with.

All documents have been seeded with data from an earlier version of the Monarch data spreadsheet and may not be completely up to date, as well as any inaccuracies that I may have introduced during the port. As well, an older version of the criteria schema contained the fields license-downstream-positive and license-downstream-negative, to attempt to track license language that specifically noted downstream use. This is no longer a separate field, and items that used to be in them have all been moved to the license-commentary field, sometimes with a leading + or -. If you feel that the comments are not longer pertinent, are too verbose, or do not make sense, please feel free to remove them--if we want them back they are in the document history.

To see all the available annotation slots and some comments about their values, please see the schema here: https://github.com/kltm/reusabledata-staging/blob/master/scripts/source.schema.yaml

Usable abbreviations for many known licenses can be found here: https://spdx.org/licenses/

Need definitions for license enums

in the schema we specify:


## The license that is used.
  ## Should try and use SPDX where we can: https://spdx.org/licenses/
  ## or: "unknown", "public domain", "all right reserved", or "custom".
  "license":
    type: str
    required: yes
  ## The type of license that is being used.
  ## If you do not know, enter "TODO".
  ## E.g. "unknown", "copyleft", "permissive", "copyright", "restrictive", etc.?
  "license-type":
    type: str
    required: yes

Lets define use of these enums, for example, is CC-BY-ND permissive or restrictive?

Add acknowledgement footer

Getting some ideas down here:

ReusableData.org was developed as part of the Monarch Initiative and the NCATS Biomedical Data Translator, where the reuse and free redistribution of publicly available data for disease discovery was burdensome. ReusableData.org was created to help others navigate the legal redistribution of public data and to help data providers make it easier for others to reuse their data.

ReusableData.org is funded by the National Center for Advancing Translational Sciences NCATS
OT3 TR002019 as part of the Biomedical Data Translator project.

We are grateful to the many original sources of our data for allowing their integration.

Add something about licensing/attribution for images.

FlyBase Curation Notes

Clearly stated license for data use:

This is not clearly present and what is presented is not expressed via a standard license. We might want to consider using this as a criteria for "clearly".
Additionally, another thought - if a license is clearly stated, but is restrictive, would it get a "point"? Technically, via the current working of this dimension, I think it would. Should we be more specific?

Allows use and reuse:

Use and reuse are allowed, but there are exceptions and limitations.
Question - I think we need to clarify for ourselves and others the relationship between the criteria dimensions. For example, it is one star per dimension? If so, star counts could actually mean very different things based on what stars are actually awarded.

Non-discriminatory

Do we need to discriminate between instances when discriminatory reuse is unavoidable versus arbitrary?

Non-revocable

No issues with this dimension - however, we might want to clarify how we are regarding this in our criteria versus legal questions of revocable and irrevocable licensing.

Freely and openly available

I think it is confusing how we are applying this dimension of the rubric. Are free and open the core requirements here? This comment is more about the language we're using in the description of the dimension.

Add grant numbers to curated entities

It would be nice to have these, especially where NIH funded

EPIC for page content

some related work

Just parking this here.
This resource: https://neo4j.het.io/browser/
also had the same problem as us: http://www.nature.com/news/legal-confusion-threatens-to-slow-data-science-1.20359?WT.mc_id=TWT_NatureNews
They have a data license table here: https://github.com/dhimmel/integrate/blob/d482033bcaa913a976faf4a6ee08497281c739c3/licenses/README.md

I like how they annotate Nodes/edges:
Source indicates the date when and location where the license information was retrieved. Blank values indicate no licensing information was found. Institution indicates where the resource was created. Funder indicates who funded the project and links to the source of the funding information.

Bootstrap popovers do not function on paged items

In the DataTable, while the Bootstrap 4a popovers function as advertised on the first page of results, they stop functioning on "paged" results.

In all likelihood, bootstrap does not get to run its init code before DataTable takes them away from the display. Either need to let Bootstrap go first, or get Bootstrap to re-init on page.

Add proper reading list for site

We should have a list of relevant reading materials. I will post these here, @kltm let me know what format we want and I can also make a file for them elsewhere.

differentiate "undetermined" from "custom"

undetermined should be used only for when we cannot find license info, yes? and if so do we still link to where it should be but isn't?

Also curate "recommendations"

Curate single most important thing that a resource could do to improve - keep it constructive.

Request for help with Coriell Institute

I'm having trouble figuring out the Coriell Institute resource, for which I've done a disappointing first pass: https://github.com/kltm/reusabledata-staging/blob/master/data-sources/coriell-institute.yaml

I guess it's really two questions. The first is: is Coriell Institute actually a real and public upstream resource for data, or is it something else that just happens to be in Monarch for some quirky reason?

As this was in the initial spreadsheet and in dipper (https://github.com/monarch-initiative/dipper/blob/4cec8174bc713702cd0eecd06ce83847d23da164/dipper/sources/Coriell.py#L64), it seems like there should be data there, but I have been unable to find any working my way in from the top-level public website so far (see question 2 as well). As well, the dipper class has private keys and passwords to possibly internal SFTP servers. Is this actually some kind of private resource that Monarch has negotiated access to? If so, should we be evaluating it?

The second question is, assuming that this is a legit public resource that we want to grade: I can find no reason to not give it 0 stars as it makes no real mention of data or license besides some spreadsheets I stumbled across; does this feel right?

kltm / reusabledata-staging Goto Github PK

reusabledata-staging's People

Contributors

Watchers

reusabledata-staging's Issues

Clearly stated license for data use:

Allows use and reuse:

Non-discriminatory

Non-revocable

Freely and openly available

Recommend Projects

Recommend Topics

Recommend Org

Jobs