exposuresprovider / icees-api Goto Github PK

View Code? Open in Web Editor NEW

2.0 5.0 8.0 11.42 MB

License: MIT License

Python 63.26% Dockerfile 0.33% Shell 0.20% Jupyter Notebook 35.78% HTML 0.44%

ncats-translator

icees-api's Introduction

How to run

Run docker compose

Run tests

test/test.sh

Deployment

Deployment of ICEES API services have been migrated to use the kubernetes infrastructure as part of the translator-devops repo. The Helm Charts for deploying different instances of ICEES API services are detailed in the README file. An updated docker image is made on each new release which is pulled automatically when ICEES API services are deployed by the Helm Charts as part of the kubernetes infrastructure. The subsections below documents details regarding updating configurations and code to build a docker image for automated kubernetes deployment of services.

Edit schema

ICEES API allows to define custom schema. The schema is stored at config/features.yml. Edit to fit your dataset.

ICEES API has the following assumptions:

Each table named <table> should have a column named <Table>Id as the identifier where <Table> is <table> capitalized. For example, for table patient, the id column is PatientId.
Each table has a column named year.

These columns do not need to be specified in features.yml.

Data for database

Data for the sqlite database file named example.db is created in a separate icees-db repo. The created sqlite database path is set by DB_PATH environment variable in .env.

Start services

The .env file contains environmental variables that control the services. Edit it to fit your application.

ICEES_PORT: the database port in the container

ICEES_HOST: the database host in the container

ICEES_API_LOG_PATH: the path where logs are stored on the host

ICEES_API_HOST_PORT: the port where icees api is listening to on the host

OPENAPI_TITLE: the title for the OpenAPI schema (default "ICEES API")

OPENAPI_HOST: the host where icees api is deployed

OPENAPI_SCHEME: the protocol where icees api is deployed

OPENAPI_SERVER_MATURITY: The server maturity (ie 'development' or 'production')

DB_PATH: the path to the SQLite database file on the host

CONFIG_PATH: the directory where schema is stored

ICEES_API_INSTANCE_NAME: icees api instance name

ICEES_INFORES_CURIE: ICEES instance identifier (see https://docs.google.com/spreadsheets/d/1Ak1hRqlTLr1qa-7O0s5bqeTHukj9gSLQML1-lg6xIHM)

run

docker-compose up --build -d

Build Container

docker build . -t icees-api:0.4.0

REST API

features schema

A feature qualifier limits values of a feature

<operator> ::= <
             | >
             | <=
             | >=
             | = 
             | <>`

<feature_qualifier> ::= {"operator":<operator>, "value":<value>}
                      | {"operator":"in", "values":[<value>, ..., <value>]}
                      | {"operator":"between", "value_a":<value>, "value_b":<value>}

There are two ways to specify a feature or a set of features, using a list or a dict. We show the schema for the former first, then show the schema for the latter.

<feature> ::= {
    "feature_name": "<feature name>",
    "feature_qualifier": <feature_qualifier>
    [,"year": <year>]
  }

where

feature name: see config/features.yml

year is optional. When year is specified, it uses features from that year, otherwise it gets the year from context

Example:

{
  "feature_name": "AgeStudyStart",
  "feature_qualifier": {
    "operator": "=",
    "value": "0-2"
  }
}

<features> ::= [<feature>, ..., <feature>]

Example:

[{
  "feature_name": "AgeStudyStart",
  "feature_qualifier": {
    "operator": "=",
    "value": "0-2"
  }
}, {
  "feature_name": "ObesityBMI",
  "feature_qualifier": {
    "operator": "=",
    "value": 0
  }
}]

In the apis that allow aggregation of bins, we can specify multiple feature qualifiers for each feature.

<feature2> ::= {
  "feature_name": "<feature name>",
  "feature_qualifiers": [<feature_qualifiere>, ..., <feature_qualifier>]
  [,"year": <year>]
}

Example:

{
  "feature_name": "AgeStudyStart",
  "feature_qualifiers": [
            {
                "operator":"=",
                "value":"0-2"
            }, {
                "operator":"between",
                "value_a":"3-17",
                "value_b":"18-34"
            }, {
                "operator":"in", 
                "values":["35-50","51-69"]
            }, {
                "operator":"=",
                "value":"70+"
            }
  ]
}

Similarly for a set of features

<features2> ::= [<feature2>, ..., <feature2>]

Example:

[{
  "feature_name": "AgeStudyStart",
  "feature_qualifiers": [
    {
      "operator":"=",
      "value":"0-2"
    }, {
      "operator":"between",
      "value_a":"3-17",
      "value_b":"18-34"
    }, {
      "operator":"in", 
      "values":["35-50","51-69"]
    },{
      "operator":"=",
      "value":"70+"
    }
  ]
}, {
  "feature_name": "EstResidentialDensity",
  "feature_qualifiers": [
    {
      "operator": "<",
      "value": 1
    }
  ]
}]

in and between are currently only supported in <feature2>.

Now, we turn to define a feature or a feature set using a dict.

<feature> ::= {"<feature name>": <feature_qualifier>} 
<features> ::= {"<feature name>": <feature_qualifier>, ..., "<feature name>": <feature_qualifier>}
<feature2> ::= {"<feature name>": [<feature_qualifier>, ..., <feature_qualifier>]} 
<features2> ::= {"<feature name>": [<feature_qualifier>, ..., <feature_qualifier>], ..., "<feature name>": [<feature_qualifier>, ..., <feature_qualifier>]}

create cohort

method

POST

route

/(patient|visit)/(2010|2011|2012|2013|2014|2015|2016)/cohort

schema

<features>

get cohort definition

method

GET

route

/(patient|visit)/(2010|2011|2012|2013|2014|2015|2016)/cohort/<cohort id>

get cohort features

method

GET

route

/(patient|visit)/(2010|2011|2012|2013|2014|2015|2016)/cohort/<cohort id>/features

get cohort dictionary

method

GET

route

/(patient|visit)/(2010|2011|2012|2013|2014|2015|2016)/cohort/dictionary

feature association between two features

method

POST

route

/(patient|visit)/(2010|2011|2012|2013|2014|2015|2016)/cohort/<cohort id>/feature_association

schema

{"feature_a":<feature>,"feauture_b":<feature>}

feature association between two features using combined bins

method

POST

route

/(patient|visit)/(2010|2011|2012|2013|2014|2015|2016)/cohort/<cohort id>/feature_association2

schema

{"feature_a":<feature2>,"feature_b":<feature2>[,"check_coverage_is_full":<boolean>]}

example

{
    "feature_a": {
      "feature_name": "AgeStudyStart",
      "feature_qualifiers": [
            {
                "operator":"=",
                "value":"0-2"
            }, {
                "operator":"between",
                "value_a":"3-17",
                "value_b":"18-34"
            }, {
                "operator":"in", 
                "values":["35-50","51-69"]
            },{
                "operator":"=",
                "value":"70+"
            }
      ]
    },
    "feature_b": {
      "feature_name": "ObesityBMI",
      "feature_qualifiers": [
            {
                "operator":"=",
                "value":0
            }, {
                "operator":"<>", 
                "value":0
            }
      ]
    }
}

associations of one feature to all features

method

POST

route

/(patient|visit)/(2010|2011|2012|2013|2014|2015|2016)/cohort/<cohort id>/associations_to_all_features

schema

{
  "feature": <feature>,
  "maximum_p_value": <maximum p value>,
  "correction": {
    "method": <correction method>
    [,"alpha": <correction alpha>]
  }
}

where correction is optional, alpha is optional. method and alpha are specified here: https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html

associations of one feature to all features using combined bins

method

POST

route

/(patient|visit)/(2010|2011|2012|2013|2014|2015|2016)/cohort/<cohort id>/associations_to_all_features2

schema

{
  "feature": <feature>,
  "maximum_p_value": <maximum p value> 
  [,"check_coverage_is_full": <boolean>],
  "correction": {
    "method": <correction method>
    [,"alpha": <correction alpha>]
  }
}

where correction is optional, alpha is optional. method and alpha are specified here: https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html

example

{
    "feature":{
        "AgeStudyStart":[
            {
                "operator":"=",
                "value":"0-2"
            }, {
                "operator":"between",
                "value_a":"3-17",
                "value_b":"18-34"
            }, {
                "operator":"in", 
                "values":["35-50","51-69"]
            },{
                "operator":"=",
                "value":"70+"
            }
        ]
    },
    "maximum_p_value": 0.1
}

knowledge graph

method

POST

route

/knowledge_graph?reasoner=&verbose=

input parameters:

query_options
- table : ICEES table
- year : ICEES year
- cohort_features: features for defining the cohort
- feature: a feature and operator and value for spliting the cohort to two subcohorts
- maximum_p_value: ICEES maximum p value. The p value is calculated for each ICEES feature in table, using 2 * n contingency table where the rows are subcohorts and the columns are individual values of that feature. Any feature with p value greater than maximum p value is filtered out.
- regex: filter target node name by regex.

if reasoner is specified, then it returns a Reason API response.

example

{
        "query_options": {
            "table": "patient", 
            "year": 2010, 
            "cohort_features": {
                "AgeStudyStart": {
                    "operator": "=",
                    "value": "0-2"
                }
            }, 
            "feature": {
                "EstResidentialDensity": {
                    "operator": "<",
                    "value": 1
                }
            }, 
            "maximum_p_value":1
        }, 
        "message": {
          "query_graph": {
            "nodes": {
              "n00": {
                "categories": ["biolink:PopulationOfIndividualOrganisms"]
              },
              "n01": {
                "categories": ["biolink:ChemicalSubstance"]
              }
            },
            "edges": {
              "e00": {
                "predicates": ["biolink:correlated_with"],
                "subject": "n00",
                "object": "n01"
              }
            }
          }
        }
}

knowledge graph overlay

method

POST

route

/knowledge_graph_overlay?reasoner=&verbose=

input parameters:

<query_options> ::= {
                      "table": <string>,
                      "year": <integer>,
                      "cohort_features": <features>
                    }
                  | {
                      "cohort_id": <string>
                    }

{
   "query_options": <query_options>,
   "message": {
      "knowledge_graph": <knowledge_graph>
   }
}

knowledge graph one hop

method

POST

route

/query?reasoner=&verbose=

if reasoner is specified, then it returns a Reason API response.

input parameters:

{
   "query_options": <query_options>,
   "message": {
      "query_graph": <query_graph>
   }
}

Examples

get cohort of all patients

curl -k -XPOST https://localhost:8080/patient/2010/cohort -H "Content-Type: application/json" -H "Accept: application/json" -d '{}'

get cohort of all patients active in a year

curl -k -XPOST https://localhost:8080/patient/2010/cohort -H "Content-Type: application/json" -H "Accept: application/json" -d '[{
  "feature_name": "Active_In_Year",
  "feature_qualifier": {
    "operator": "=",
    "value": 1
  }
}]'

get cohort of patients with AgeStudyStart = 0-2

curl -k -XPOST https://localhost:8080/patient/2010/cohort -H "Content-Type: application/json" -H "Accept: application/json" -d '[{
  "feature_name": "AgeStudyStart",
  "feature_qualifier": {
    "operator":"=",
    "value":"0-2"
  }
}]'

Assuming we have cohort id COHORT:10

get definition of cohort

curl -k -XGET https://localhost:8080/patient/2010/cohort/COHORT:10 -H "Accept: application/json"

get features of cohort

curl -k -XGET https://localhost:8080/patient/2010/cohort/COHORT:10/features -H "Accept: application/json"

get cohort dictionary

curl -k -XGET https://localhost:8080/patient/2010/cohort/COHORT:10/features -H "Accept: application/json"

get feature association

curl -k -XPOST https://localhost:8080/patient/2010/cohort/COHORT:10/feature_association -H "Content-Type: application/json" -d '{
  "feature_a": {
    "feature_name": "AgeStudyStart",
    "feature_qualifier: {"operator":"=", "value":"0-2"}
  },
  "feature_b": {
    "feature_name": "ObesityBMI",
    "feature_qualifier": {"operator":"=", "value":0}
  }
}'

get association to all features

curl -k -XPOST https://localhost:8080/patient/2010/cohort/COHORT:10/associations_to_all_features -H "Content-Type: application/json" -d '{
  "feature": {
    "feature_name": "AgeStudyStart",
    "feature_qualifier": {"operator":"=", "value":"0-2"}
  },
  "maximum_p_value":0.1
}' -H "Accept: application/json"

knowledge graph

curl -X POST -k "http://localhost:5000/knowledge_graph" -H  "accept: application/json" -H  "Content-Type: application/json" -d '
{
        "query_options": {
            "table": "patient", 
            "year": 2010, 
            "cohort_features": {
                "AgeStudyStart": {
                    "operator": "=",
                    "value": "0-2"
                }
            }, 
            "feature": {
                "EstResidentialDensity": {
                    "operator": "<",
                    "value": 1
                }
            }, 
            "maximum_p_value":1
        }, 
        "message": {
            "query_graph": {
                "nodes": {
                    "n00": {
                        "categories": ["biolink:PopulationOfIndividualOrganisms"]
                    },
                    "n01": {
                        "categories": ["biolink:ChemicalSubstance"]
                    }
                },
                "edges": {
                    "e00": {
                        "predicates": ["biolink:correlated_with"],
                        "subject": "n00",
                        "object": "n01"
                    }
                }
            }
        }
}
'

knowledge graph schema

curl -X GET -k "http://localhost:5000/knowledge_graph/schema" -H  "accept: application/json"

How to run qc tool

The qc tool is under the qctool directory. The following commands are run in the qctool directory

installation

pip install -r requirements.txt

running

Example:

python src/qc.py \
    --a_type features \
    --a ../config/all_features.yaml \
    --b_type mapping \
    --b ../config/FHIR_mappings.yml \
    --update_a ../config/all_features_update.yaml \
    --update_b ../config/FHIR_mappings.yml \
    --number_entries 10 \
    --similarity_threshold 0.5 \
    --table patient visit \
    --ignore_suffix Table _flag_first _flag_last

Usage:

python src/qc.py --help

icees-api's People

Contributors

Stargazers

Watchers

Forkers

xu-hao colinkcurtis sharma05 maximusunc davefol aojesanmi

icees-api's Issues

ICEES+ COVID-19 FHIR files transferred from Rockfish to Cadence server

This issue is intended to transfer ICEES+ COVID-19 FHIR files from Rockfish to Cadence server, using protocol and process agreed to by NC TraCS/SOM. James to complete this task after permissions have been updated.

"Cancel" feature

This issue relates to the API "cancel" feature, which does not appear to be working.

Gateway Time-out Error

'knowledge_graph' versus 'query_graph'

Per @[email protected], please change 'query_graph' to 'knowledge_graph' and then update the examples provided here and also posted in this doc.

One year vs multi-year patient data pulls

We currently capture data on a one-year basis. However, multi-year patient data pulls may avoid issues related to conflicting birth dates, etc.

Consider options and decide on the most efficient one.

Errors in categories for KG node

Posting this:

{"message": {"query_graph": {"nodes": {"n1": {"category": "biolink:ChemicalSubstance", "is_set": false}, "n0": {"id": "MONDO:0004979", "category": "biolink:Disease", "is_set": false}}, "edges": {"e01": {"subject": "n0", "object": "n1", "predicate": "biolink:correlated_with"}}}}}'

to /knowledge_graph_one_hop

gives this node in the knowledge_graph:

"MONDO:0004979": {
            "category": [
              "biolink:D",
              "biolink:I",
              "biolink:S",
              "biolink:E",
              "biolink:A",
              "biolink:S",
              "biolink:E"
            ]
          },

Which should of course be:

"MONDO:0004979": {"category": ["biolink:Disease"]},

Error: name 'engine_map' is not defined

FYI: The ICEES API is returning a new error message.

Provide ReasonerStd overlay/support API

We would like to use ICEES as a support edge generator using co-occurrences. The interface would be the Reasoner API. The input query graph would not be required, but there would be an input knowledge_graph, and often a list of results.

A cohort could be defined in options to the query, but will default to the universal cohort.

The service will then return a knowledge graph that is the old knowledge graph, plus new edges that represent co-occurrence between the concepts. A simple version could add such an edge between every pair of nodes in the graph; a more refined approach would only connect nodes that appear in a result together. Some of this logic may be appropriated from the robokop messenger support endpoint (paging @patrickkwang ).

I could also imagine options for things like significance cutoffs.

ReasonerStdAPI compliance outdated

Version 0.9.2 of the ReasonerStdAPI uses fields query_graph, knowledge_graph, and results, and has a somewhat different results format from the older version that we're currently using. There may be one or two other small differences.

Transfer ICEES+ COVID-19 tables from Rockfish to ebcr0.renci.org

This issue is related to the transfer of ICEES+ COVID-19 integrated feature tables from Rockfish to ebcr0.renci.org. James will accomplish this task after his permissions have been updated.

Location: /opt/RENCI/output/icees/COVID/v1

Files with no extensions are csv files
.json files contain bin values for feature variables

Exposures bins - patient table, 2016

This purpose of this issue is to obtain bin values for the following variables (patient table, 2016): EstResidentialDensity, EstHouseholdIncome, AvgDailyPM2.5Exposure_2, AvgDailyPM2.5Exposure_qcut, MaxDailyOzoneExposure_2, MaxDailyOzoneExposure_2_qcut. This is for a JACI manuscript I am developing with Dave Peden. Target submission date is December 4, 2020.

Update to terms and conditions of use

Please replace the current text with the following (not sure if the syntax will carry over properly):

Terms and Conditions of Service": "The Translator Integrated Clinical and Environmental Exposures Service (ICEES) is providing you with Data that have been de-identified in accordance with 45 C.F.R. §§ 164.514(a) and (b) of HIPAA and that UNC Health is permitted to provide under 45 C.F.R. § 164.502(d)(2). Recipient agrees to notify UNC Health via the Renaissance Computing Institute at UNC Chapel Hill in the event that Recipient receives any identifiable data in error and to take such measures to return the identifiable data and/or destroy it at the direction of UNC Health.\n\nRestrictions on Recipient’s Use of Data. Recipient further agrees to use the Data exclusively for the purposes and functionalities provided by the ICEES service: cohort discovery; feature-rich cohort discovery; hypothesis-driven queries; and exploratory queries. Recipient agrees to use appropriate safeguards to protect the Data from misuse and unauthorized access or disclosure. Recipient will report to UNC Health any unauthorized access, use, or disclosure of the Data not provided for by the Service of which Recipient becomes aware. Recipient will not attempt to identify the individuals whose information is contained in any Data transferred pursuant to this Service Agreement or attempt to contact those individuals. Recipient agrees not to sell the Data to any third party for any purpose. Recipient agrees not to disclose or publish the Data in any manner that would identify the Data as originating from UNC Health. Finally, Recipient agrees to reasonably limit the number of queries to the Service per IP address within a given time interval, in order to prevent rapid ‘attacks’ on the Service.\n\nWe kindly request that users of this service provide proper attribution for any secondary products (e.g., manuscripts, podium presentations, software) derived from the use of ICEES. Attribution should include acknowledgement of support from the National Center for Advancing Translational Sciences, National Institutes of Health [OT3TR002020, OT2TR003430, UL1TR002489] and the Clinical Research Branch, Intramural Research Program of the National Institute of Environmental Health Sciences, National Institutes of Health [ZID ES103354-01]. Finally, please acknowledge or, if appropriate, include as co-author(s) any individual person(s) who contributed significantly to secondary products resulting from use of ICEES.\n\nFor additional information or to report issues, please contact Karamarie Fecho ([email protected]).\n

Overlay endpoint

This issue is to correct errors with the KG overlay endpoint.

Active_In_Year

We may not be calculating Active_In_Year correctly or in the most appropriate manner.

Consider alternative options and select most appropriate one.

upgrade to OpenAPI 3.0

Multivariate tables

Here are my thoughts:

Without selecting Active_In_Year, one is examining all of the available data. When asking questions about a specific year, one is effectively examining only those patients who are active in that year, so assumption about the proportion of patients who, e.g., have a diagnosis of obesity in year XXXX cannot be made on the basis of the entire cohort. Creating multivariate tables from all of the available data will maximize the sample size in the final table (well, assuming relatively complete features are included), but it is perhaps a bit non-intuitive.
When Active_In_Year is selected, one is examining all patients who were active in year XXXX, so assumptions about the proportion of patients who, e.g., have a diagnosis of obesity in year XXXX are valid. Creating multivariate tables from all patients who were active in a given year will result in a smaller sample size and will not leverage, e.g., history of exposures, but it is perhaps a bit more intuitive.

I think both applications, meaning with/without Active_In_Year have value, but that the choice depends on one's goals. For the ICEES+ KP Analytics WG, I'm inclined to have them move forward with (1), but with no exclusions related to, e.g., TotalEDInpatientVisits=0 or ObesityDx =1. For the project I am working on with Dave, I'm inclined to focus on one year only by selecting Active_In_Year=1.

Resolving Issues #20 and #21 and Summarizing Next Steps

Thanks for meeting yesterday, @xu-hao @cbizon @stevencox @patrickkwang. I've attempted to summarize the agreed-upon plan and next steps below. Please feel free to edit.

ARAGON or ARA calls to ICEES+ will point to the full ICEES+ dataset or a union of each available year for all available patients (years 2010-2016, with years 2017-2019 soon to be available). For feature variables that are not available for a given year, ICEES+ will treat the variable as missing for that year.
ICEES+ will return all permutations for an entity - entity association defined in a call from ARAGORN or another ARA. For example, in response to a request for PM2.5 - asthma associations, ICEES+ will return AvgDailyPM2.5Exposure_cut x AsthmaDx, MaxDailyPM2.5Exposure_cut x AsthmaDx, AvgDailyPM2.5Exposure_qcut x AsthmaDx, etc.
ICEES+ will return Chi Square statistics and uncorrected P values. ICEES+ also will return counts and bin values, as well as the binning function that was used for a given feature variable (i.e., pandas.cut, pandas.qcut, custom [SME or literature-based binning].
For unbinned/unbounded feature variables such as TotalEDInpatientVisits, we will categorize. In this example, we will categorize TotalEDInpatientVisits as 0...10+. As other such variables arrive in the future, we will follow the same general approach.
ARAGORN will overlay ICEES+ data as supporting evidence in its KG. Other teams may adopt a different approach.

Trouble running queries on v2.0.0

This query works from the command line but not from the Swagger interface ({}, 2.0.0, patient, 2016):

curl -X POST "https://icees.renci.org:16339/patient/2016/cohort" -H "accept: application/json" -H "Content-Type: application/json" -d "{}"

Same thing with this query (for COHORT:19):

curl -X GET "https://icees.renci.org:16339/patient/2016/cohort/COHORT%3A19/features" -H "accept: text/tabular”

This query simply does not run for me:

curl -X POST "https://icees.renci.org:16339/patient/2016/cohort" -H "accept: text/tabular" -H "Content-Type: application/json" -d "{"TotalEDInpatientVisits":{"operator":"=","value":0}}"

400 Error

Neither does this one:

curl -k -XPOST https://icees.renci.org:16339/patient/2016/cohort/COHORT:19/feature_association -H "Content-Type: application/json" -d “{"feature_a":{"Race":{"operator":"=", "value":”African American”}},"feature_b":{"AvgDailyPM2.5Exposure_StudyAvg_qcut":{"operator":"<", "value":3}}}”
'The system cannot find the file specified.' (edited)

I think there's an issue with v2.0.0 tables? Possibly also some syntax errors?

I shouldn't be able to create a cohort from 2016 (new tables), if they aren't loaded into v2.0.0, right?

Also, in my CURLS, I probably should need to specify the version, correct?

Sorry, but I'm a bit confused, and if I'm confused, then others will be, too.

I suspect that the confusion relates to our rush to put out a prototype, but we probably should address these issues sooner rather than later.

UNC Health - EPR Hash-Match

The PIDs for the COVID-19 dataset are not the same as those for the asthma dataset (i.e., the PIDs are no longer numeric, but rather some sort of hash function), so we need a new cross-walk file. Notified Emily and James on 11/05/20.

maximum or minimum p-value?

These seem to be in conflict:
https://github.com/NCATS-Tangerine/icees-api/blob/8e63483446c92dda5892578ec83cf0716787b30b/app.py#L426
https://github.com/NCATS-Tangerine/icees-api/blob/8e63483446c92dda5892578ec83cf0716787b30b/app.py#L459

EPR YAML Configuration File

AFTER the prototype demo, perhaps we can make the following changes to the EPR data in the YAML config file:

The following feature variables should be mapped to 'gene' and 'phenotypic feature' or just 'gene'.
SNP1:
type: string
enum: ['A', 'C', 'B']
biolinkType: PhenotypicFeature
SNP2:
type: string
enum: ['Z', 'X', 'Y']
biolinkType: PhenotypicFeature
SNP3:
type: string
enum: ['E', 'F', 'D']
biolinkType: PhenotypicFeature
SNP4:
type: string
enum: ['N', 'L', 'M']
biolinkType: PhenotypicFeature

The following feature variables should be mapped to 'chemical substance':
O3_ANNUAL_AVERAGE_cut:
type: integer
minimum: 1
maximum: 5
biolinkType: Environment
O3_ANNUAL_AVERAGE_qcut:
type: integer
minimum: 1
maximum: 5
biolinkType: Environment
O3_N_OBS:
type: integer
biolinkType: Environment
PM25_ANNUAL_AVERAGE_cut:
type: integer
minimum: 1
maximum: 5
biolinkType: Environment
PM25_ANNUAL_AVERAGE_qcut:
type: integer
minimum: 1
maximum: 5
biolinkType: Environment
PM25_N_OBS:
type: integer
biolinkType: Environment

Provide ReasonerStd query API

There is an API that implements the reasoner API, but it is not a generic query api. The current API (as I understand it) takes a cohort as an option, then performs some kind of disproportionality analysis between that cohort and the set difference of the universal cohort and the input cohort, and returns edges from the cohort to the entity that is disproportionately represented.

I suggest 2 changes:

The above workflow/query is quite different from the rest of the ICEES approach. In the rest of icees, you define a cohort, and then look for associations between entities within that defined cohort. I think that the query API should do the same.
The machine question should be fully generalized to allow for any query, not just population to feature.

That may be a lot of work, but here are a series of iterations that would each be useful, in my opinion.

First, a query that can answer any one hop within the universal cohort
Second, extend that to taking the cohort as an option in the query
Third, extend to answering arbitrary-shaped queries
Fourth, providing partial matches to arbitrary shaped queries.

Cohort Selection

At present, ICEES contains data on asthma-like patients at UNC Health and asthma-like participants at NIEH EPR. We are developing, or soon will develop, additional cohorts on COVID-19, DILI, and PCD. Each cohort will be created under separate IRB-approved protocols and CDWH Oversight Committee data requests, with different data fields captured for each cohort. We'd like users of the UI to be able to select the cohort of interest. I don't think we want separate ICEES endpoints, but rather one endpoint containing tables for each cohort. I'm open to suggestions, however.

Update Terms and Conditions of Service

Please replace the existing text with the text below (ignore syntax):

"Terms and Conditions of Service": "The Translator Integrated Clinical and Environmental Exposures Service (ICEES) is providing you with Data that have been de-identified in accordance with 45 C.F.R. §§ 164.514(a) and (b) of HIPAA and that UNC Health is permitted to provide under 45 C.F.R. § 164.502(d)(2). Recipient agrees to notify UNC Health via RENCI in the event that Recipient receives any identifiable data in error and to take such measures to return the identifiable data and/or destroy it at the direction of UNC Health.\n\nRestrictions on Recipient’s Use of Data. Recipient further agrees to use the Data exclusively for the purposes and functionalities provided by the ICEES service: cohort discovery; feature-rich cohort discovery; hypothesis-driven queries; and exploratory queries. Recipient agrees to use appropriate safeguards to protect the Data from misuse and unauthorized access or disclosure. Recipient will report to UNC Health any unauthorized access, use, or disclosure of the Data not provided for by the Service of which Recipient becomes aware. Recipient will not attempt to identify the individuals whose information is contained in any Data transferred pursuant to this Service Agreement or attempt to contact those individuals. Recipient agrees not to sell the Data to any third party for any purpose. Recipient agrees not to disclose or publish the Data in any manner that would identify the Data as originating from UNC Health. Finally, Recipient agrees to reasonably limit the number of queries to the Service per IP address within a given time interval, in order to prevent rapid ‘attacks’ on the Service.\n\nWe kindly request that users of this service provide proper attribution for any secondary products (e.g., manuscripts, podium presentations, software) derived from the use of ICEES. Attribution should include acknowledgement of support from the National Center for Advancing Translational Sciences, National Institutes of Health [OT3TR002020, OT2TR003430, UL1TR002489] and the Intramural Research Program of the National Institute of Environmental Health Sciences, National Institutes of Health. Finally, please acknowledge or, if appropriate, include as co-author(s) any individual person(s) who contributed significantly to secondary products resulting from use of ICEES.\n\nFor additional information or to report issues, please contact Karamarie Fecho ([email protected]).\n"

EstResidentialDensity and ur, v2.0.0, year 2010

Something seems off with the binning for EstResidentialDensity and ur, v2.0.0, year 2010. I think they should be the same, since EstResidentialDensity = 3 (urbanized area) does not exist within this cohort? In other words, (EstResidentialDensity = 1) = (ur = R)?

Validation error

This issue is intended to address or correct apparent 422 validation errors in the response body of ICEES API requests. For instance, I'm receiving the apparent error below, even when running a successful query. This may be nothing, but I haven't encountered the error previously.

Scroll and download option for ICEES output

This issue is intended to correct scrolling and download options for ICEES API output.

Directionality

We need to find a way to return information on 'directionality' to users, in addition to general effect (i.e., Chi Square statistic, P value). In other words, not just a P value that shows that two groups differ in frequency, but also which group is, e.g., more frequently exposed versus less frequently exposed.

refactory deployment of multiple version

move 1.0.0 and 2.0.0 into branches
build docker image for each version
put nginx in from of multiple containers
deploy using docker stack

Formalize cohort data sources

This can be more of a discussion.

Right now, it looks like the cohort data source comes back as features inside that cohort. This means that the user will have to request all cohorts for a given table and year, and then filter from there based on what data source they want. Is this how we want it to work?

Do we want to move the data source to be a parameter in the initial call to get cohorts?

Should the data sources be on different ports?

consolidate v1 to v3 data

Currently, v3 is running on port 16339 while v1 and v2 are running on port 80.

v1 and v2 are using the url patterns
https://icees.renci.org/<version>/

v3 is using the url pattern
https://icees.renci.org:16339/

We should consolidate them so they all run under the port 80.

Solution: Put all three versions behind nginx.

Counts related to ED, Inpatient, ED/Inpatient, etc.

The counts aren't necessarily making sense.

Action: Review scenarios and reach out to James/Emily if needed.

Table versioning

I think we need to provide an option to define table versions, i.e., 1.0.0, 2.0.0, 3.0.0.

2015 and 2016 visit-level tables

AFTER prototype demo, please check the visit-level tables for calendar years 2015 and 2016. The total rows are identical (N= 1048575), which seems highly unlikely.

Fatal Error

FYI...I suspect that this relates to a TranQL thing, but I'm not sure, so I'm posting a ticket as an FYI.

API endpoints for bin values

Email exchange on November 17, 2020:

The bin values for the other variables can be found in the YAML file that Priya created. (Note that the binning for CAFO_Distance, LandfillDistance, and PublicSchoolDistance may need modification or are TBD.)

And also here, here, here, and here.

I apologize for the disjointed email. I have been attending the AMIA 2020 Virtual Annual Symposium this week and am a bit distracted.

If a meeting would be helpful to discuss the plan and next steps, please let me know.

Thanks again,

Kara

On Tuesday, November 17, 2020, 9:48:42 PM EST, Karamarie Fecho, PhD [email protected] wrote:

Patrick, Max,

Thanks for offering to create API endpoints to expose the binning values for ICEES+ feature variables.

The relevant exposures files can be accessed here:

(1) airborne pollutant data (eight exposures) from most recent pull, 2002-2016
this can be accessed from most RENCI machine (for example vm) /projects/datatrans/new_cmaq_data/merged_cmaq_.csv*
(2) ACS socioeconomic data for both sampling periods, 2007-2011, 2012-2016
this is on rockfish: /var/fhir/other/acs/ACS_NC_2016_with_column_headers.csv has all variables but ur in /var/fhir/other/acs/Appold_trans_geo_cross_02.10.10 - trans_geo_cross.csv

The bins are based on pandas.cut and pandas.qcut for (1) and pandas.qcut for (2). The bin values for the other variables can be found in the YAML file that Priya created. (Note that the binning for CAFO_Distance, LandfillDistance, and PublicSchoolDistance may need modification or are TBD.)

Priya,

Please confirm that you've updated your pull request for the YAML file. The last update that I saw was from 8 days ago.

Lisa,

I believe you have access to the files for (1) and (2). Assuming that's correct, please transfer them to Patrick and Max.

Text Acknowledgements

Suggest replacing the current acknowledgements text with the following text:

"We kindly request that users of this service provide proper attribution for any products (e.g., manuscripts, podium presentations, software) derived from the use of ICEES. Attribution should include acknowledgement of the funder, the National Center for Advancing Translational Sciences, Biomedical Data Translator Program (awards OT3TR002020 and OT2TR002514) and Center for Translational Science Award Program (UL1TR002489). Please also acknowledge the Renaissance Computing Institute, UNC Health Care System, UNC Hospitals, and any affiliated team members who contributed to the work.”

Please keep the DUA-like terms and only replace the acknowledgements text, separated by a paragraph break.

Deploy rate limiting via NGINX

https://www.nginx.com/blog/rate-limiting-nginx/

This will replace FlaskLimiter, which will disappear when we move to FastAPI.

add schema definitions for request bodies

Example ICEES+ KG Queries

ICEES+ KG Query 1: What chemical substances are African Americans with asthma exacerbations exposed to greater than chance?

curl -X POST -k "https://icees.renci.org:16339/knowledge_graph" -H "accept: application/json" -H "Content-Type: application/json" -d '{
"query_options": {
"table": "patient",
"year": 2016,
"cohort_features": {
"TotalEDInpatientVisits": {
"operator": ">",
"value": 1
}
},
"feature": {
"Race": {
"operator": "=",
"value": "African American"
}
},
"maximum_p_value":0.5
},
"machine_question": {
"nodes": [
{
"id": "n00",
"type": "population_of_individual_organisms"
},
{
"id": "n01",
"type": "chemical_substance"
}
],
"edges": [
{
"id": "e00",
"type": "association",
"source_id": "n00",
"target_id": "n01"
}
]
}
}' -o output.txt

Internal Server Error

Issues with new feature: associations of one feature to all features using combined bins

I tried the new feature using the provided example, but something is not quite right. For example:

I think the issue lies with the "between" operation:

"operator":"between",
"value_a":"3-17",
"value_b":"18-34"

Bin Values

A somewhat pressing issue is to develop an approach for returning bin values (e.g., bin cut-off points for PM2.5) to users or to otherwise provide that documentation. Otherwise, users will need to contact you/me to obtain this info.

Use nodenormalization if/where possible

https://nodenormalization-sri.renci.org/apidocs/

This is in some places inconsistent with the lists of synonyms in config/identifiers.yml. This is true even when the different prefixes are accounted-for (e.g. PUBCHEM vs. PUBCHEM.COMPOUND). In at least one case, the ICEES synonym list is actually better (it correctly collapses terms that the node normalizer erroneously separates), but we should seek to use the same synonymization process everywhere. If there are issues with the node normalizer (there are), those should be reported in the appropriate repo.

Mappings

See this file, also summarized below.

Initial goal is to use the information provided here (i.e., N3C elements and additional UNC Health elements) to modify the ICEES+ API YAML file such that it supports both the existing data elements and the new ones, with variables and levels defined and with a mapping to the appropriate Biolink model entity type.

Start with N3C data elements and UNC Health elements below

For labs, we may need to use the first and last flags; we are NOT using absolute values with reference ranges, at least not initially

Note that the flags are not standardized across labs, so we may need to find a clever way or tool (e.g, LOINC2HPO?) to handle this

Then add EPR survey data elements

Then coordinate with Chris to add EPR SNP data elements (his team is already working on this portion of the work)

ICEES+ API CONFIG FILES:

https://github.com/NCATS-Tangerine/icees-api/tree/master/config

SCHEMA EXAMPLE:

Sex:

type: string

enum:

-Male

-Female 

-Unknown

-Other

biolinkType: PhenotypicFeature

Biolink model:

https://biolink.github.io/biolink-model/

Second goal is to create a YAML mapping file to map the ICEES+ API data elements to FHIR elements, using the schema and FHIR identifier systems provided below.

SCHEMA FOR YAML MAPPING FILE:

feature_variable_name:

 fhir_resource_name:

system: system

code: code1

system: system2

code: code2

EXAMPLE:

DiabetesDx:

Condition:

system: http://hl7.org/fhir/sid/icd-10-cm

code: O24.42

IDENTIFIER SYSTEMS:

http://hl7.org/fhir/sid/icd-10-cm

http://www.nlm.nih.gov/research/umls/rxnorm

http://loinc.org

http://terminology.hl7.org/ValueSet/v3-Race

http://terminology.hl7.org/ValueSet/v3-Ethnicity

http://hl7.org/fhir/2018Sep/valueset-birth-sex.html

CODE LOOK-UP SERVICE:

https://athena.ohdsi.org/search-terms/start

Third goal is to attach Translator preferred identifiers to each of the elements in the ICEES+ API YAML file, using the identifier systems provided below.

TRANSLATOR PREFERRED IDENTIFIERS:

ChEBI
ChEMBL
MONDO
LOINC
RxNORM

ICEES+ Longitudinal Queries Option

Given the clinical use case questions that we are developing here and the focus on space, time, etc., we would like to include an option for longitudinal queries. We'd have to think this through a bit, but perhaps as a first-pass effort, we could simply add a parameter that allows users to select all years or (perhaps preferred) select a year for each query parameter?

refactor to version 3

remove 1.0.0
refactor 2.0.0 to 3.0.0

New endpoint for modified 1 x N feature associations

This issue is to create a new endpoint to support modified 1 x N feature associations that allow users to select any number of bins for the variable of interest, similar to the modified 2 x 2 feature association function.

Change Feature Variable Name

Please change IN_ICEES to IN_UNCHEALTH.

QC check on FHIR mapping file

I did a QC check on the FHIR mapping YAML file. The results can be found here.

Briefly, I checked the mappings for 21 data elements, randomly chosen from each of the 4 major categories (lab measurements, procedures, medications, conditions). I found issues with the mappings for 3 data elements. Of those, 1 issue (blood type - measurement) was trivial, 1 issue (NIPPV - procedure) was somewhat significant, and 1 issue (supplemental oxygen - procedure) was significant.

Given that the significant issues were all procedures, and I checked all of the data elements classified as procedure, then I think we're probably in good shape. That said, we may wish to do additional QC, at least for the measurements.

exposuresprovider / icees-api Goto Github PK

icees-api's Introduction

How to run

Run docker compose

Run tests

Deployment

Edit schema

Data for database

Start services

Build Container

REST API

features schema

create cohort

get cohort definition

get cohort features

get cohort dictionary

feature association between two features

feature association between two features using combined bins

associations of one feature to all features

associations of one feature to all features using combined bins

knowledge graph

knowledge graph overlay

knowledge graph one hop

Examples

How to run qc tool

installation

running

icees-api's People

Contributors

Stargazers

Watchers

Forkers

icees-api's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs