metadatacenter-attic / phs-gdc Goto Github PK

PHS-GDC Prototype

HTML 3.20% CSS 6.26% JavaScript 75.67% Python 12.98% R 1.88%

data-commons react data-science knowledge-graph phs metadata semantic-technologies knowledge-representation knowledge-base environmental-data

phs-gdc's Introduction

Introducing the PHS-Data Commons Wizard Prototype

This advanced prototype blends the talents of the Google Data Commons team, the Stanford Center for Population Health Sciences (PHS) team, and the Stanford Center for Biomedical Informatics Research (BMIR) team to create a low-friction tool to discover and use freely available research data.

See the Instructions section for information on using this tool.

Opportunity

The data

The Google-provided Data Commons repository provides access to many public data sets for research in many domains. The Data Commons Graph aggregates data from many different data sources into a single graph database, so that it can be accessed in a consistent way. This data is browseable by place or entity, and publicly accessible via APIs, with numerous python libraries and Python workbook examples.

The researchers

Researchers in the Stanford PHS team, like all biomedical data researchers, must find, obtain, and integrate useful data quickly. At PHS, the goal is to improve the health of populations by bringing together diverse disciplines and data to understand and address social, environmental, behavioral, and biological factors on both a domestic and global scale. PHS takes a data management approach that openly seeks more advanced ways to access and integrate related data sets, and with BMIR provides research-centric leadership of the PHS-Data Commons project.

The developers

The BMIR team brings its experience with research data and metadata, semantic technologies, and highly usable and scalable research software to make Data Commons resources readily accessible to PHS researchers and the larger research community. Through analysis of the data models and availability in the Data Commons, the BMIR team finds and implements highly efficient ways for researchers to find any relevant data and bring it into their own data sets.

The Data Commons Wizard

The Data Commons Wizard prototype provides a simple interface through which researchers can specify the locations and topics for which they want data. Researchers can select whether they want to receive the timestamp and/or other provenance for the data. The Wizard can download the latest data for those locations and topics, as they are available in the Data Commons, or provide R code that will query the Data Commons directly for the data. (Code is also available to integrate the resulting values into the researcher's data table, using the chosen location type (zip codes, city name, etc.) as the lookup index for the retrieved values.

The result is a simple web interface that can provide data for most Data Commons statistical variables to researchers at any level, including non-technical users.

Project Status

The PHS-Data Commons project prototype will be under development at least through Spring 2021 to increase its capabilities and its ease of use. Given additional funding, we will make it more powerful and even easier to use. We also want to begin designing similar technologies to retrieve data from the Biomedical Data Commons graph.

Instructions

In the first column, enter the location type you want to use as your index variable, and the specific locations for which you want to retrieve data. When entering specific locations, you can enter them as individual locations of that type (follow the format of the example under the entry window), or you can choose to get data for all the locations in one or more states. Autocompletion is provided when there are a small number of entities.

Be aware that the a Statistic Variable may not have data in every location type, or in every individual location within a location type. To save time finding appropriate location types for viewing your Statistical Variable(s), you can view an availability table showing the available location types for all the Statistic Variables. The summary page provides additional information about the table and how to use it.

Once you have entered locations and Statistical Variables, you can request a data file containing the requested data ('Download Data'), and/or R code to let you work with the data ('Generate R Code'). R code is provided for two use cases: integrating data from the provided file with your data, or making requests for the data directly to the Data Commons REST APIs. In either case, you will need to initialize the R libraries before the first time you use the code.

Some options are provided to configure the returned data. Click on the gear menu in the third column to see the latest options.

Providing feedback

Three options are provided for your feedback. You can enter a ticket in the Wizard GitHub issues, use the Feedback link in the lower right of the Wizard, or send email to jgraybeal (at) stanford.edu. You can visit the Wizard's GitHub repository by clicking on the GitHub icon in the upper right of the third column.

phs-gdc's People

Stargazers

Watchers

phs-gdc's Issues

transform (crosswalk) data into common location type

Is your feature request related to a problem? If so please describe.
Data from the same area might only be indexed using different location types. Issue #28 talks about how such data could be found and returned, but the user might need to data organized according to a particular location type.

Describe the solution you'd like
A clear and concise description of what you want to happen.
Data that is organized by location types other than the primary one can be transformed (with suitable statistics if necessary) to fit into the primary location type items that are requested. Mechanisms must be in place to cope with partial overlaps.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
None, aside from the current model, which is the user does it.

Additional context
Add any other context or screenshots about the feature request here.
This will be especially useful when the original location type is much larger than the location types for the other found data, but can also be appropriate if the measured variables tend to have limited variation within their locations.

Mathew: For PHS they have to crosswalk between different things, e.g., OPTUM changes every year. He has to build his own crosswalk file (PHS crosswalk from zip code to counties is out of date, missed 6000 zip codes).

[Dec 20 2020 interview with PHS researchers]

The DC stat/set endpoint is missing the provenanceDomain field

The DC team would need to add the missing field

semantic expansion or browsing

Is your feature request related to a problem? If so please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Users don't know the 'right' words to find what they want.

Describe the solution you'd like
A clear and concise description of what you want to happen.
Some combination of:

Expand user-typed requests to other similar terms that are used in the DC.
Support broader/narrower choices.
Offer navigation through related semantic concepts
Structure the existing Statistical Variable semantics to be more navigable

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
A separate 'semantic assistant' that would let the user explore the semantic mappings outside of the Wizard.

Additional context
Add any other context or screenshots about the feature request here.
Some of this is a bit researchy, but leverages the insights we have from BioPortal and our other recent proposals. The more advanced capabilities could get the Visionary tag.

Complete phase 1 MVP release

Complete the features needed to declare phase 1 (MVP) done, and release them to public site.

make content availability feedback more advanced

Is your feature request related to a problem? If so please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Issue #16 describes content availability but can only handle simple availability indications.

Describe the solution you'd like
A clear and concise description of what you want to happen.
Fluid guidance to users based on the information they've added so far.

If Variable is initial search constraint, focus feedback on location types available
Provide immediate searching insights or hints, if no data is found or related data can be identified (see also semantic expansion features)
Provide query modification tips based on data availability
Provide deeper (more detailed) feedback on the availability of content
Find similar content based on requested data topics

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
The nature of the advanced feedback will depend heavily on the nature of other characteristics of the Wizard.

Additional context
Add any other context or screenshots about the feature request here.

add location type census tract

Add census tract to the supported location types.

(AACES: M Bondy, E Peters, A Lawson)

add datasets for genetic data (?)

What kind of genetic data is available?

(AACES: J Schildkraut)

create "how to use this tool" text

Write up introduction to using the Wizard and figure out where to put it (front of the Readme?)

add datasets addressing social determinants of health

(AACES: E Peters)

Add datasets for Community Multiscale Air Quality (CMAQ) Ozone and PM 2.5

What type of need does this dataset address?
Local air quality data indexed by census tract (not zip code)

If there is a specific dataset that meets this need, please specify it.
Describe the name, provider, license location, and IRI of the dataset. If the dataset does not have a specific license, please indicate why you think it is publicly available, and who is the authority publishing the dataset.

From Andrew Lawson: The EPA stores CMAQ estimates for daily Ozone and PM 2.5 at various scales. They have data files at https://www.epa.gov/hesc/rsig-related-downloadable-data-files#faqsd. Most of the files are for 12 Km grids but they also have (or had) census tract (CT) estimates.

Looks like the best site per first comment below will be https://www.caces.us/data.

Generally speaking EPA and other US government data is freely available.

Describe other options for this kind of data
If you don't have a specific data set in mind, provide a clear and concise description of any alternative datasets or sources you've considered.

If these are not now available then they can be reconstructed from gridded data. I am checking on the availability currently.

Additional context
Add any other context about the dataset request here. In particular, if you are an expert on this dataset, or can recommend someone who is, please make that clear.

Virtually all our spatial work is at CT or block group (BG) level. Zipcodes can change quite a lot and so they are not used. In the 2010 census the zipcodes were matched to the CTs and ZCTAs were produced. However it is simpler to stick with CTs and BGs in general. FIPS codes can be used to locate the CTS or BGs

look up existing 'Google partnership'

Joellen Schildkraut suggests meeting with Google via Scarlett Gomez. (not sure the exact project involved)

create workflow for new dataset requests

Work with PHS and Google to create a workflow for new requests for datasets.

Find any existing data set needs

Reach out to centers or people PHS knows to make this list more complete.

A list may exist from previous exercises.

add datasets w/ environmental data by census tract or block group

Want environmental data aggregated by more local information, like census tract or block group.

AACES (T Lawson)

add datasets for PM2.5

PM2.5 is particulate matter 2.5 microns or smaller. There is already one data set with this, is it enough?

(AACES: A Lawson)

Verify no data left behind

Is your feedback related to a problem? If so please describe.
How can we be sure all the DC data is made accessible?

Describe the idea you have or solution you'd like to see
Somehow we need to verify that all the data in the graph has been represented in the Wizard(s).

Describe alternatives you've considered

Really good triple queries that find everything.
Not worrying about it.

Additional context
This seems like a conceptually hard problem to do with certainty (especially in a repeatable way).

Add datasets for local water quality

What type of need does this dataset address?
_Information about possible contaminants, localized to specific areas

If there is a specific dataset that meets this need, please specify it.
Describe the following dataset details as best you can.
Name: tbd
IRI: tbd
Distribution provider: USGS?
Publishing authority (if different than provider): USGS?
License (URL where declared, or explanation of why you think it is publicly available): public US gov. data

The ideal dataset would be localized to census tract or better. Andrew Lawson writes:

I haven’t looked into water based pollutants yet but we may be able to get radiation in well water (USGS). I'll check

Describe other options for this kind of data
If you don't have a specific data set in mind, provide a clear and concise description of any alternative datasets or sources you've considered.

Additional context
Add any other context about the dataset request here. In particular, if you are an expert on this dataset, or can recommend someone who is, please make that clear.

add location type ZCTA

Make ZIP Code Tabulation Areas supported location types.

'ZIP Code Tabulation Areas (ZCTAs) are generalized areal representations of United States Postal Service (USPS) ZIP Code service areas'

(AACES A. Lawson)

R interface returning structured data

Is your feature request related to a problem? If so please describe.
Some data may be organized structurally, e.g. as an array of observations, or some request may be fulfilled with a structured collection of data.

Describe the solution you'd like
A clear and concise description of what you want to happen.
Consider designing a rich R interface that returns highly structured data according to a well-defined standard.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Josef showed some early conceptual drawings of this idea. Not clear if it is immediately useful, but some of the ideas may be integrated in the developed R code.

temporal search constraints

Is your feature request related to a problem? If so please describe.
Some users only want data if it's in a specific time range.

Describe the solution you'd like
A clear and concise description of what you want to happen.
An interface to constrain the data discovery according to its creation time, e.g., "Only data after xxx / before yyy"

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Incorporating the request into the content availablity feedback. (This would likely be quite hard.)

Additional context
Add any other context or screenshots about the feature request here.

automatically add new DC data

Is your feature request related to a problem? If so please describe.
If new Statistical Variables become available, we can't be changing the code constantly to include them.

Describe the solution you'd like
A clear and concise description of what you want to happen.
An offline process needs to scour the graph for new Statistical Variables and add them to the appropriate UIs.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Or, we could use a foolproof notification process for new Statistical Variables. But this seems fraught.

Additional context
Add any other context or screenshots about the feature request here.
If the new Statistical Variable has a new location type, then we have a different level of problem.

add location type census block group

Add census block group to the supported location types. (Wanted to support neighborhood-centric queries.)

(AACES: E Peters)

highly-faceted searching

Is your feature request related to a problem? If so please describe.
Users want to be able to select datasets quickly based on various facets that describe them.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Per Mathew: something similar to IPUMS interface to show where it comes from and year available

be great to combine that with the census geospatial

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Dec 10 2020 interview with PHS researchers]

add datasets for historical data (n.b. EPA)

EPA data is available at the census tract level and goes back 20 to 30 years.

Note A Lawson's request for PM2.5 data (#12), which exists in Data Commons, may be relatable.

(AACES: J Schildkraut)

show content availability for current choices

Is your feature request related to a problem? If so please describe.
Can't tell whether any content is available for chosen variable-location type-location items.

Describe the solution you'd like

Any time user updates the location type, location list items, or selected variables, an indication of data availability should be present for all the statistical variables.
- Minimum is that each statistical variable show whether data are available for that variable given current settings; better is to show percentage of the specific location items which have that variable.
- In any case, answer must be available by location type, and could also be shown by location items (e.g, each zip code).
If variable(s) are entered before location type, show the availability (percentage) across all the location types.
- This lets user choose the best location types for their variables
- Also as secondary priority if location type has been specified.

Describe alternatives you've considered

Best way to present multivariate availability (in variable x location types and variable x location items cases) is unclear.
- There are fewer location types so showing those horizontally for each variable could work
- If the entry interface were a blank spreadsheet with locations entered in column 1 and variables in row 1, availability could be displayed on the fly. Ditto if location types were the top few items in column 1 of the blank spreadsheet.
Not displaying or graying out the variables that do not have data available for the location type. (This could be a configuration.)
How many rows of data are available for a given location type is a poor proxy for the percentage, but much better than just a check mark
Not updating the availability whenever location information changes (because that could be expensive).

Additional context

When time ranges are introduced, availability should reflect the given time range as well.
More complex/subtle availability controls will be needed eventually (e.g. so I can see if more data is available in another location type).

Add datasets for school district 'slices'

What type of need does this dataset address?
A clear and concise description of the purpose of the dataset. Ex. Provide information that can be used to [...]
To perform analyses on specific aspects within a school district, e.g., data for particular school types (high school, middle, elementary), or school demographics.

If there is a specific dataset that meets this need, please specify it.
Describe the following dataset details as best you can.
Name:
IRI:
Distribution provider:
Publishing authority (if different than provider):
License (URL where declared, or explanation of why you think it is publicly available):

Describe other options for this kind of data
If you don't have a specific data set in mind, provide a clear and concise description of any alternative datasets or sources you've considered.
User Mathew suggests there are data sets that target locations narrower than school districts, e.g., using school demographics. Q: Is it provided that way, or do we need to infer that info somehow? K-12 level has in-district segregation, district info is already aggregated. Examples are report codes on college acceptance, and more sophisticated placement data.

Additional context
Add any other context about the dataset request here. In particular, if you are an expert on this dataset, or can recommend someone who is, please make that clear.

[Dec 10 2020 PHS researcher feedback—Mathew]

See Data Availability

Changes needed for users to see what data is available given their current settings.

enhance available metadata

Is your feature request related to a problem? If so please describe.
FAIR data requires that metadata be reasonably complete about the source processes and entities that produced the data, as well as other contextual information that may be essential to interpreting the data. Researchers need to be able to point to the source of the data and to verify the data coming from that source.

Describe the solution you'd like
A clear and concise description of what you want to happen.
Rich metadata about each variable's source should be captured by the Data Commons and made accessible via the wizard. Even if metadata changes from one value to the next, this should be reflected as well. Metadata should be provided as well-defined JSON-LD. Metadata sources should be identified with IRIs wherever possible, and controlled terms should be used whenever possible to describe the values.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
If the data is to be used for research conclusions, providing only simple metadata is insufficient for publication. It can be acceptable for exploration, however.

Additional context
Add any other context or screenshots about the feature request here.
Dataset-level metadata should be consistent with DataCite expectations, and is sufficient if it describes all the variables within that dataset and the dataset is consonant with the Statistical Variable.

support multiple location types simultaneously

Is your feature request related to a problem? If so please describe.
Some users will want any data that is relevant, no matter the location type. Some will want data only for certain location types (but more than one).

Describe the solution you'd like
A clear and concise description of what you want to happen.
Allow users to specify areas using either location items of a particular type, or via a map selection, but is capable of finding and returning data in that area that is indexed by other location types (either all other types, or selected other types). This option would be a configuration setting, and would likely need sub-settings to control whether other data locations have to be wholly within the chosen area or can be overlapping it.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Merging data of different location types into a common location type is another (even harder) task (#29).
Use ACS Fact Finder style interface to let you navigate to the regions you want to specify [Mathew]

Additional context
Add any other context or screenshots about the feature request here.
This is a key contribution to making the whole experience frictionless.

Delete variables with no data?

Is your feedback related to a problem? If so please describe.
Some 13% of Statistical Variables have no data in the 4 location types we implemented.

Describe the idea you have or solution you'd like to see
A description of what you think could happen or want to happen.
Should we not even bother showing those Statistical Variables?

Describe alternatives you've considered
A clear and concise description of any alternative ideas or features you've considered.

Additional context
Add any other context or screenshots about the feedback here.
This will be fixed by virtue of #16, and will be kind of a pain to implement.

timeline capability

Is your feature request related to a problem? If so please describe.
Can't access all data over time (only most recent)

Describe the solution you'd like
An "all data over time" configuration selection could be effective

Describe alternatives you've considered

Select range of time for which data is desired (future)
Select desired frequency (further in future)

Additional context
Would probably want to know how much data is availability within time range for each variable

tweak UI variable language

In the UI, we have a little confusion stemming from word choices. This is fairly easily fixed, here is my suggestion:

Select Index Variable -> Select Locations
Select the variable that will link your data to Data Commons -> Select locations to link your data to Data Commons
Variable -> Location Type
Select a variable from the list -> Select a location type from the list
Variable values -> Entering location values
Enter values by hand -> Enter location values by hand
Use all values from selected locations -> Use all values from selected region (US states)
Create box label (for entering values by hand): 'Location list'
Change box label (for using all values from selected location): 'Location' -> 'Selected states'
Change default label inside drop-down menu: Location -> States
Select locations (US states) -> Select region (US states)

Arguably it will be clearer the first time if there is no default radio button selected, and the box at the bottom isn't there. But it's mildly tricky to make obvious that the next step is to choose one of the radio buttons. (Maybe if we want to remove the box, then make the header of the radio selector section "Entering location values (choose one)"?)

Also, in box 2 suggest changing the following:

Select DC Variables -> Select Data Variables
Select the Data Commons variables you want to retrieve values from -> Select Data Commons variables to retrieve values from
Search for variables by name or browse the hierarchy -> Select from the hierarchy or begin typing variable name
Delete 'Note that' , begin with 'Many variables…'

Map returned data

Is your feature request related to a problem? If so please describe.

Provide the ability to view returned data in a map.

Describe the solution you'd like

A button in the third window that says "Preview Map". Selecting the button produces a list of the statistical variables that will be returned, a selector for each to request a map for that variable, and another selector (defaults to false) for the whole window that says 'Availability Map' (if true all maps just show where data is available). Values in each area are color coded along a spectrum that goes from minimum to maximum returned value. The map fills in contoured areas based on the location type. Multiple maps can be requested in one run. Map can be scaled a lot to increase size/focus on limited area; default scaling fits the scale of the returned data (smushing AK and HI into a corner somewhere). Mousing over the map shows value under the mouse button. Supports numeric and CV term value types (includes boolean).

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Automatically provide the map in a separate window from the data.
Allow assignment of colors to range of values.
Date types? (can't think of a use case)

Additional context
Add any other context or screenshots about the feature request here.

This is primarily a data validation tool, to see if the data looks reasonable for the calculations. However, it can also be used to produce print-ready maps suitable for publication, e.g., to show data coverage or comparative data. In particular, it could be used to show coverage of a variable as it is chosen.

select data from preferred data sources

Data from Data Commons can come from multiple resources. Provide mechanisms to let users choose which of those resources are used for the returned data.

Selections might be made in advance, e.g., via prioritization of data sources to use for a particular request, or for all requests, from the list of all data sources (or at least the ones used to service the current request).

Or the user could be prompted once the data is retrieved, though this requires returning all the data from all sources.

(AACES: M Bondy, A Lawson)

Create an 'available data' table

Is your feedback related to a problem? If so please describe.
We need a way for users to look up what data is available for what location types, and to link to it from the DCW.

Describe the idea you have or solution you'd like to see
A description of what you think could happen or want to happen.
Create a google sheet showing the status relationships between statistic variables and location types. As a first cut just showing whether there are any variables accessed by a given location type is sufficient.

Describe alternatives you've considered
A clear and concise description of any alternative ideas or features you've considered.
Expressing the relationship as a percentage—how many of the location instances have data values for a given statistic variable and location types, vs how many location instances there are—would be icing on the cake.

Additional context
Add any other context or screenshots about the feedback here.
Don't get distracted by the perfect or long-term solution, having the basic filled-out table is plenty good enough for now. Pass along to Marcos any lessons that could apply to the long-term in-app solution, though. (See issue #16)

access non-geospatial data (e.g., biomedical)

Is your feature request related to a problem? If so please describe.
There's data in the DC that isn't registered to a geospatial location—all the biomedical data are in this category.

Describe the solution you'd like
We need a way to see that non-geospatial data and bring it back.

Describe alternatives you've considered
This probably need to be managed from a different UI, the current UI doesn't make sense.

Additional context
The fact much of this data is essentially triples—in many cases replicating BioPortal data (loosely speaking)—means we have to think about what the use cases/problems are that we're trying to solve.

add datasets focused on neighborhood resources

Add more data sets that address local access (neighborhood level) to resources (healthy food, transportation, parks, etc.)

Suggests the need for census block group data type support (#6).

Mapping with ArcGIS

Add the ability to work with the data in ArcGIS.

This could mean:

converted to ArcGIS format (doesn't add much value, ArcGIS groks CSV)
retrieved data viewable in the Wizard as a map (doesn't have to be done with ArcGIS; see #22)
supporting mapping in the PHS environment (wouldn't affect the Wizard)

(Mentioned by M Bondy on several occasions. Also AACES-discussed.)

Census Locations

Adding key Census locations to the DCW location types

metadatacenter-attic / phs-gdc Goto Github PK

phs-gdc's Introduction

Introducing the PHS-Data Commons Wizard Prototype

Opportunity

The data

The researchers

The developers

The Data Commons Wizard

Project Status

Instructions

Providing feedback

phs-gdc's People

Stargazers

Watchers

phs-gdc's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs