GithubHelp home page GithubHelp logo

ohdsi / phevaluator Goto Github PK

View Code? Open in Web Editor NEW
17.0 9.0 6.0 4.77 MB

An R package for evaluating phenotype algorithms.

Home Page: https://ohdsi.github.io/PheValuator/

R 99.60% Perl 0.16% Shell 0.21% HTML 0.03%

phevaluator's Introduction

PheValuator

Build Status codecov

PheValuator is part of HADES.

Introduction

The goal of PheValuator is to produce a large cohort of subjects each with a predicted probability for a specified health outcome of interest (HOI). This is achieved by developing a diagnostic predictive model for the HOI using the PatientLevelPrediction (PLP) R package and applying the model to a large, randomly selected population. These subjects can be used to test one or more phenotype algorithms.

Process Steps

The first step in the process, developing the evaluation cohort, is shown below:

The model is created using a cohort of subjects with a very high likelihood of having the HOI. These "noisy" positives ("noisy" in that they are very likely positive for the HOI but not a true gold standard) are called the "xSpec" cohort - extremely specific. This cohort will be the Outcome (O) cohort in the PLP model. There are several methods to create this cohort but the simplest would be to develop a cohort of subjects who have multiple condition codes for the HOI in their patient record. A typical number to use might be 5 or more condition codes for acute HOI's, say myocardial infarction, or 10 or more condition codes for chronic HOI's, say diabetes mellitus. We also define a noisy negatives cohort. This cohort is created by taking a random sample of the subjects in the database who have no evidence of the HOI. These would be determined by creating a very sensitive cohort, in most cases 1 or more condition codes for the HOI and excluding these subjects for the noisy negative cohort. The xSpec cohort and the noisy negative cohort are combined to for the Target (T) cohort for the PLP model. We then create a diagnostic predictive model with LASSO regularized regression using all the data in the subjects record. The data to inform this model is created using the FeatureExtraction package. The data includes conditions, drug exposures, procedures, and measurements. The developed model has a set of features with beta coefficients that can be used to discriminate between those with the HOI and those without.

We next create and "evaluation" cohort - a large group of randomly selected subjects to be used to evaluate the phenotype algorithms (PA). The subjects are selected by pulling up to 1,000,000 subjects from the dataset. We extract the same covariates as we extracted form the T cohort in the model creation phase. We use the PLP function applyModel to apply the model to this large cohort producing probabilities for the HOI for each subject in the evaluation cohort. The subjects in this cohort with their associated probability of the HOI are used as a "gold" standard for the HOI. We save this output for use in the next step of the process

The second step in the process, evaluating the PAs, is shown below:

The next step in the process tests the PA(s). Phenotype algorithms are created based upon the needs of the research to be performed. Every subject in the evaluation cohort should be eligible to be included in the cohort developed from this algorithm. The figure describes how the predicted probabilities for subjects either included or excluded from the phenotype algorithm cohort are used to evaluate the PA. To fully evaluate a PA, you need to estimate the sensitivity, specificity, and positive and negative predictive values. These values are estimated through calculations involving subjects that are True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). These statistics are generated using the predicted probabilities. Examples of the calculations are shown in the diagram. The formulas for the final calculations are also displayed.

The results from the evaluation for Opioid Abuse is shown below:

The diagram shows the complete performance evaluation for 5 PAs for the Expected Value as described above where the predicted value is used for summing the TP, FP, TN, and FN values. The full table created by the function also includes the performance characteristics based on the prediction thresholds specified when running the function.

Technology

PheValuator is an R package.

System Requirements

Requires R (version 3.3.0 or higher). Installation on Windows requires RTools. Some of the packages used by PheValuator require Java.

Installation

  1. See the instructions here for configuring your R environment, including Java.

  2. In R, use the following commands to download and install PheValuator:

    install.packages("remotes")
    remotes::install_github("ohdsi/PheValuator")

User Documentation

Documentation can be found on the package website.

PDF versions of the documentation are also available:

Support

Contributing

Read here how you can contribute to this package.

License

PheValuator is licensed under Apache License 2.0

Development

PheValuator is being developed in R Studio.

Development status

Beta

Acknowledgements

  • The package is maintained by Joel Swerdel and has been developed with major contributions from Jenna Reps, Peter Rijnbeek, Martijn Schuemie, Patrick Ryan, and Marc Suchard.

phevaluator's People

Contributors

jswerdel avatar schuemie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

phevaluator's Issues

SQL dialect issue - Postgresql not supported?

Hi,

I am using PheValuator package and investigating the errors that I see in errorReport.txt.

I was able to see that there are some dialect issue when I tried to execute the error report query in Postgresql. Does PheValuator doesn't support Postgresql?

For ex under the getPopnPrev.sql file, I was able to see the below line

and ((startYear between year('@startDate') and year('@endDate')) or (endYear between year( '@startDate') and year('@endDate'))))

But for the query to work in Postgresql, I had to fix this as shown below (Note I have used DATE keyword)

and ((startYear between year(DATE '@startDate') and year(DATE '@endDate')) or (endYear between year(DATE '@startDate') and year(DATE '@endDate'))))

But good thing is being able to see the errorReport and the query which causes the error. So it helps us to locate and fix the error. In a way useful for us

Defaulting arguments to "" is not helpful

For example here. If an argument is required, give it a sensible default value if it exists, or else do not give it a default value. (e.g. xSpecCohort,

If it is not required, default it to NULL.

Currently the RStudio code issue highlighter doesn't indicate a problem if a user leave's out a required argument.

Error connecting

when I try to use the EvaluatingPhenotypeAlgorithms example I get this error. I can't tell if this is on my end or your end.

connectionDetails <- createConnectionDetails(dbms = "postgresql",

  •                                          server = "localhost/ohdsi",
    
  •                                          user = "joe",
    
  •                                          password = "supersecret")
    

phenoTest <- PheValuator::createPhenoModel(connectionDetails = connectionDetails,

  •                                        xSpecCohort = 1769699,
    
  •                                        cdmDatabaseSchema = "my_cdm_data",
    
  •                                        cohortDatabaseSchema = "my_results",
    
  •                                        cohortDatabaseTable = "cohort",
    
  •                                        outDatabaseSchema = "scratch.dbo", #a database schema with write access
    
  •                                        trainOutFile = "PheVal_10X_DM_train",
    
  •                                        exclCohort = 1770120, #the xSens cohort
    
  •                                        prevCohort = 1770119, #the cohort for prevalence determination
    
  •                                        estPPV = 0.75,
    
  •                                        modelAnalysisId = "20181206V1",
    
  •                                        excludedConcepts = c(201820),
    
  •                                        cdmShortName = "myCDM",
    
  •                                        mainPopnCohort = 0, #use the entire subject population
    
  •                                        lowerAgeLimit = 18,
    
  •                                        upperAgeLimit = 90,
    
  •                                        startDate = "20100101",
    
  •                                        endDate = "20171231")
    

xSpecCohort 1769699
cdmDatabaseSchema my_cdm_data
cohortDatabaseSchema my_results
cohortDatabaseTable cohort
outDatabaseSchema scratch.dbo
trainOutFile PheVal_10X_DM_train
exclCohort 1770120
prevCohort 1770119
estPPV 0.75
modelAnalysisId 20181206V1
excludedConcepts 201820
addDescendantsToExclude FALSE
cdmShortName myCDM
mainPopnCohort 0
lowerAgeLimit 18
upperAgeLimit 90
gender 8507
gender 8532
startDate 20100101
endDate 20171231
Connecting using PostgreSQL driver
Error in rJava::.jcall(jdbcDriver, "Ljava/sql/Connection;", "connect", :
org.postgresql.util.PSQLException: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.

excludedCovariateConceptIds

Hi, after reading the PheValuator 2.0 paper it seems that no concepts need to be purposefully excluded in a phevaluator analysis now: "It also allows for use of all possible predictors in the models, particularly the health condition diagnosis codes". From this, I am able to set excludedCovariateConceptIds in createDefaultCovariateSettings as null. However, when I look at the definition for this param in the code I see the following description (

#' @param excludedCovariateConceptIds A list of conceptIds to exclude from featureExtraction. These
):
#' @param excludedCovariateConceptIds A list of conceptIds to exclude from featureExtraction. These
#' should include all concept_ids that were used to define the
#' xSpec model (default=NULL)

Was this param description not updated per PheValuator 2.0 or should we still include xSpec concept ids in excludedCovariateConceptIds? Thanks so much!

(new feature request) PPV vs prevalence curve

PheValuator provides phenotype definition performance statistics such as sensitivity, specificity and PPV. PPV depends on the prevalence of the phenotype in the datasource. Prevalence in terms depends on the population in the datasource and the observation period per person in the datasource. PPV may thus vary by datasource.

It would be nice to have an output on the relationship between prevalence and PPV similar to as described here https://newonlinecourses.science.psu.edu/stat507/node/71/ .

FAQ's - Phevaluator

Hello,

Thanks for creating this package which I am sure will be helpful for many users.

Though I am learning to use this, I have got few questions which are listed here. I have listed it like FAQ's which can definitely help others like me who are getting started.

Can you help us with this to get started?

https://forums.ohdsi.org/t/phevaluator-faqs/9706

Allow user to specify work folder

Currently the various functions write to whatever is the current working folder, which is bad form.

Add a workFolder argument to the various functions (that may well default to getwd() if you like), but at least the user can override it.

This should make it so the user doesn't have to call setwd before calling the functions (as is now prescribed in the vignette).

Add unit tests

There are currently no unit tests in this package, possibly leaving bugs undetected. It should be possible to create unit tests using Eunomia.

Issue of Circularity

Hi,

I was going through all the issues in github to recap and also read your paper. Quick questions on "Circularity" and how to avoid it.

Let's say I have used PheKB algorithm to build my XSpec cohort. Now as we use "excludedconcepts" parameter in the function "CreatePhenoTypeModel", I understand it ignores all the concepts used in the XSpec cohort definition.

a) May I kindly check with you on why do we have to exclude this concepts used to build XSpec and XSens cohorts? I am aware through exercise that if we don't exclude it will result in error/warning message. But may I check why do we have to exclude? Because usually XSpec cohorts involve well-thought out cohort definition (including most relevant single/multiple data domain elements) to get highly positive items. If we aren't going to consider these codes/data elements, will it not impact our prediction/model performance to generate labels? After-all they were chosen to be used under XSpec in the first place because they can clearly say whether a person is positive or negative? So why do we exclude them?

b) let's say our clinicians came up with a new phenotype algorithm called "Magic" to identify T2DM patients.

Am I right to understand that XSpec cohort definition will always and must be different from the Phenotype algorithm (Magic) that we intend to assess.

c) To evaluate our Phenotype Algorithm ("Magic"), Am I right to understand it's essential to have a XSpec cohort definition which performs better than "Magic". Meaning XSpec cohort definitions are used to generate the probabilistic gold standard. So they ought to be really top-class and in a sense better than the Phenotype algorithm (Magic) itself.

Only then it can ensure a fair assessment of our in-house Phenotype algorithm ("Magic"). Right?

R code Execution error and NULL file as output instead of csv

Hi,

I had this error while executing the testPhenotypeAlgorithm function. This is how my code looks like for this function

phenoResult <- PheValuator::testPhenotypeAlgorithm(connectionDetails = connectionDetails,
                                                   cutPoints = c("EV"),
                                                   evaluationOutputFileName = "C:/Users/test1/Desktop/DQD/Phevaluator/eval_output.rds",
                                                   phenotypeCohortId = 74,
                                                   phenotypeText = "T2DM",
                                                   modelText = ">= condes",
                                                   xSpecCohort = 104,
                                                   xSensCohort = 104,
                                                   prevalenceCohort = 104,
                                                   cdmShortName = "cdm",  # i have just given the cdm schema name here
                                                   cohortDatabaseSchema = "results",
                                                   cohortTable = "cohort",
                                                   washoutPeriod = 0)

This is how the error message looks like

image

and produces a "NULL" file like below but no csv.

image

Is it due to the dataset again? can't be because I had around 985 cases in XSpec and Noisy Negatives. Hence model was built and everything. There was no warning message which would happen when we have low number cases. So I guess count shouldn't be the problem.

Can help me know what's causing this issue?

No Outcomes - PLP Error

Hi,

I encountered this error with the latest package. Is it due to our data? But should that result in error like below?

Error in PatientLevelPrediction::getPlpData(connectionDetails, cdmDatabaseSchema = paste(cdmDatabaseSchema, : No Outcomes

I did come across this issue in the OHDSI forum https://forums.ohdsi.org/t/phevaluator-faqs/9706/8 here. But guess the your response to another issue no 16, can help me understand why is this happening

Feature not extracted from all domains

Hello,

Earlier when I ran PheValuator, I was able to see that only Demographic features were pulled. Based on discussion with @jswerdel , I recently populated my era tables.

But still I don't see any features from other domains like measurements, conditions, drugs etc.

May I kindly check with you on why does this happen? Because when I inspected the code of createChronicDefaultCovariates, I see that TRUE is set for features from different domains. I was playing around with different parameters in the function but still couldn't get the covariates from other domains. Am I missing something here? I feel this issue could be due to parameter values because there shouldn't be any reason why other domain features are skipped. Am I making any mistakes in configuring parameter values? Do I have to input any special parameter values based on my data characteristics listed below.

Data characteristics

a) Our dates are de-identified. Meaning the chronological order is maintained but shifted into future like 2200, 2400, 2600 etc.
b) Not all our 5.2k patients have visit data. Only 4.7k patients have visit data. Similarly for other domains. Not all patients have data for all domains but more than 80-85% of our population have data for domains like conditions, drugs, measurement, visits. We don't have procedure data at all.
c) Observation period starts at 1900-01-01 and ends at 3900-12-30 for all patients.
d) I don't think positive and negative case distribution doesn't matter in this case, as the issue here I feel could not be due to that.

Kindly request you to let me know if you require any more information

Sql error - PheValuator .createEvaluationCohort Creating evaluation cohort on server from sql

Invalid operation: table name "obs2" specified more than once;

CREATE TABLE scratch.test_eval_6_d6cy5vqr
DISTKEY(SUBJECT_ID)
AS
SELECT
CAST(0 AS BIGINT) as COHORT_DEFINITION_ID, person_id as SUBJECT_ID ,
DATEADD(day,CAST(0 as int),visit_start_date) COHORT_START_DATE,
DATEADD(day,CAST(1 as int),visit_start_date) COHORT_END_DATE
FROM
(select
v.person_id, FIRST_VALUE(visit_start_date) OVER (PARTITION BY v.person_id ORDER BY MD5(RANDOM()::TEXT || GETDATE()::TEXT) ROWS UNBOUNDED PRECEDING) visit_start_date,
ROW_NUMBER() OVER (ORDER BY MD5(RANDOM()::TEXT || GETDATE()::TEXT) ) rn
from cdm.visit_occurrence v
JOIN cdm.observation_period obs
on v.person_id = obs.person_id
AND v.visit_start_date >= DATEADD(d,CAST(365 as int),obs.observation_period_start_date)
AND v.visit_start_date <= DATEADD(d,CAST(-30 as int),obs.observation_period_end_date)
join (
select person_id,
datediff(day, min(observation_period_start_date), min(observation_period_end_date)) lenPd,
min(observation_period_start_date) observation_period_start_date,
min(observation_period_end_date) observation_period_end_date,
count(observation_period_id) cntPd
from cdm.observation_period
group by person_id) obs2
on v.person_id = obs2.person_id
and v.visit_start_date >= obs2.observation_period_start_date
and v.visit_start_date <= obs2.observation_period_end_date
and lenPd >= 730
and cntPd = 1
join cdm.person p
on v.person_id = p.person_id
and EXTRACT(YEAR FROM visit_start_date) - year_of_birth >= 0
and EXTRACT(YEAR FROM visit_start_date) - year_of_birth <= 120
and gender_concept_id in (8507,8532)
join (
select person_id,
datediff(day, min(observation_period_start_date), min(observation_period_end_date)) lenPd,
min(observation_period_start_date) observation_period_start_date,
min(observation_period_end_date) observation_period_end_date,
count(observation_period_id) cntPd
from cdm.observation_period
group by person_id) obs2
on v.person_id = obs2.person_id
and v.visit_start_date >= obs2.observation_period_start_date
and v.visit_start_date <= obs2.observation_period_end_date
and lenPd >= 730
and cntPd = 1
where visit_start_date >= cast('19001010' AS DATE)
and visit_start_date <= cast('21000101' AS DATE)
and v.visit_concept_id in (9201,9202,9203,581477,262)
and datediff(day, visit_start_date, visit_end_date) >= 0
and 11*(9*(v.visit_occurrence_id/9)/11) = v.visit_occurrence_id
) negs
where rn <= cast('2000000' as bigint)
union
select 0 as COHORT_DEFINITION_ID, SUBJECT_ID, cp.COHORT_START_DATE COHORT_START_DATE,
DATEADD(day,CAST(1 as int),cp.COHORT_START_DATE) COHORT_END_DATE
from #cohort_person cp
join cdm.observation_period o
on cp.SUBJECT_ID = o.person_id
and cp.COHORT_START_DATE >= o.observation_period_start_date
and cp.COHORT_START_DATE <= o.observation_period_end_date
where rn <= 100
union
select 6 as COHORT_DEFINITION_ID, SUBJECT_ID, cp.COHORT_START_DATE COHORT_START_DATE,
DATEADD(day,CAST(1 as int),cp.COHORT_START_DATE) COHORT_END_DATE
from #cohort_person cp
join cdm.observation_period o
on cp.SUBJECT_ID = o.person_id
and cp.COHORT_START_DATE >= o.observation_period_start_date
and cp.COHORT_START_DATE <= o.observation_period_end_date
where rn <= 100

relation eligibles does not exist & invalid input syntax for integer: "2e+06" - CreateEvaluationCohort

Hi,

I encountered another error while I was trying to execute the CreateEvaluationCohort function.

Please find the error report shown below

`DBMS:
postgresql

Error:
org.postgresql.util.PSQLException: ERROR: invalid input syntax for integer: "2e+06"
Position: 993

SQL:

insert into temp.test_cohort0448834563139826 (COHORT_DEFINITION_ID, SUBJECT_ID, COHORT_START_DATE, COHORT_END_DATE)
 (select 0 as COHORT_DEFINITION_ID, person_id as SUBJECT_ID, (visit_start_date + 0*INTERVAL'1 day') COHORT_START_DATE,
            (visit_start_date + 1*INTERVAL'1 day') COHORT_END_DATE
      from (select  co.subject_id as person_id, co.COHORT_START_DATE as visit_start_date,
						row_number() over (order by MD5(RANDOM()::TEXT || CLOCK_TIMESTAMP()::TEXT)) rn
					from results.cohort co
					join cdm.person p
					  on co.subject_id = p.person_id
						and  EXTRACT(YEAR FROM co.COHORT_START_DATE) - year_of_birth >= -500
						and EXTRACT(YEAR FROM co.COHORT_START_DATE) - year_of_birth <= 1000
						and gender_concept_id in (8507,8532)
	#error is here			join eligibles v5 --include only subjects with a visit in their record and within date range
						on co.subject_id = v5.person_id
					where co.cohort_definition_id = 105
						
						) negs
     #error is here where rn <= cast('2e+06' as bigint)

    union
      select 0 as COHORT_DEFINITION_ID, SUBJECT_ID, o.observation_period_start_date COHORT_START_DATE,
        (o.observation_period_start_date + 1*INTERVAL'1 day') COHORT_END_DATE
      from cohort_person cp
      join cdm.observation_period o
        on cp.SUBJECT_ID = o.person_id
          and cp.COHORT_START_DATE >= o.observation_period_start_date
          and cp.COHORT_START_DATE <= o.observation_period_end_date
      where rn <= 100
      union
      select 103 as COHORT_DEFINITION_ID, SUBJECT_ID, o.observation_period_start_date COHORT_START_DATE,
        (o.observation_period_start_date + 1*INTERVAL'1 day') COHORT_END_DATE
      from cohort_person cp
      join cdm.observation_period o
        on cp.SUBJECT_ID = o.person_id
          and cp.COHORT_START_DATE >= o.observation_period_start_date
          and cp.COHORT_START_DATE <= o.observation_period_end_date
      where rn <= 100
      )

Two issues here

  1. Why the row_number(rn) is in exponential form? ex : rn <= cast('2e+06' as bigint). You can see in the query

  2. Where is this "eligibles" stored? Under which schema? I don't see it under temp schema. So when I manually tried to execute the above query, I encountered an error as

              `relation "eligibles" does not exist`
    

Can help us with this?

Question - Best approach to generate XSpec, XSens cohorts?

Hi @jswerdel,

Just to provide context and to be useful for other users to understand, I will start with an example as usual to avoid any confusions. Let's say I have a cohort of 5200 patients in which 90% of them T2DM. Rest 10% is T1DM or unspecified. Please note that , I don't have explicit labels as such. But we know that our cohort comprises mostly of T2DM patients (more than 90%).

Dataset info

XSpec >= 5 condition codes for T2DM - gives 57 subjects
XSens >= 1 condition code for T2DM - gives 985 subjects
Noisy Negatives = 4237 subjects (this violates our cohort characteristics described above. we know that 90% of T2DM. but let's discuss below)

Now my questions are

a) As you can see with the above cohort definition for XSpec, it gives inaccurate estimates of T2DM patient count (positive cases) in our dataset. Who decides on the XSpec cohort definition whether it is >=5 condition codes or >=10 condition codes? If it's gonna be a clinician, then they might provide me a list of rules for defining XSpec cohort definition? So we reaching out to clinician is like again asking them to do something like a manual chart review and arrive at rules to get a better estimate of XSpec. Am I right?

Because if we wish to get an accurate estimate of the T2DM (positive cases), I might have to again go for detailed phenotype algorithm which will consider all domains to identify a patient as T2DM or not. More rules, better and accurate estimate of T2DM population in our db. Right? Any other better way to do this?

b) So for XSpec cohort, should I use rule based phenotype algorithm (ex: PheKB) to identify highly positive cases? I identify positive and negative cases using PheKB, meaning I have a label now (better estimates for XSpec because we know that PheKB is validated across sites). Any suggestions here?

c) On what basis and how do you usually define XSpec and XSens cohorts at your end?Do you seek inputs from clinicians or use Phenotype Algorithm to define these cohorts itself? Have you ever encountered that sticking to conditions codes (>=10 or >=5) gives better estimates of positive cases under your population?

c) So now let's say I have the results of locally developed custom phenotype algorithm and it has returned for ex: 4900 patients as T2DM. So now to verify/assess the performance of this new local algorithm, should I compare them with PheKB labels and get sensitivity, specificity, etc?

d) As we don't have ground truth (provided by clinicians), we rely on PheValuator to help us provide some probabilistic values of whether they belong to positive or negative class. Am I right?

e) In case like above (d), May I kindly request you to let me know how can the labels generated using PheValuator model be called as probabilistic gold standard? it can only be called as gold standard when clinicians review it. Am I right? I know you use the term Probabilistic but is it a gold standard? Why is there a term called as gold standard ? Can I kindly request you to help me with this?

f) Target cohort = XSpec (57 subjects) + Noisy Negatives (4237 subjects)
Outcome cohort = Label 1 (57 subjects) , Label 0 (4237 subjects).

But in reality, I know that there are lot of T2DM (positive - Label 1) cases in this 4237 subjects (who are noisy negatives based on my XSpec and XSens cohort defs).

g) I understand the number of subjects under each cohort is purely based on cohort definition that we have XSpec, XSens and Noisy Negatives. Right?

h) I might be wrong and kindly request you to correct me here. Only way to get accurate estimates of three cohorts can/may be possible only through Phenotype Algorithm (because only they give better estimates than compared to rules like >=5 codes or >=10 codes)

I) So, It's like using a Phenotype Algorithm like "PheKB" to build cohorts and assess the performance of new Phenotype Algorithm "Local algo". Are we trying to do something like this?

h) To me this approach seems to be not okay, because there might be scenarios where we might be interested to assess the performance of PheKB algo itself. In a case like this, I cannot create a XSpec cohort based on PheKB and again assess the performance of PheKB. That's not useful..Kindly request you to correct me here. I feel that I am kind of stuck in this infinite loop.. Haha

Code errors identified by R check

In the R check results, under "checking R code for possible problems" there are some issues I think are very serious. For example:

createAcutePhenotypeModel: no visible global function definition for
  ‘errorCheck’

The createAcutePhenotypeModel() calls the errorCheck() function, but this function is not defined in the package. Instead, the function is in the extras folder, and would not be available to the user at runtime.

So right now if the user calls createAcutePhenotypeModel() this will generate an error.

Cannot change working directory

Hello Everyone,

I tried the PheValuator package and with the help of @jswerdel was able to overcome the issues while execution and I guess I am encountering another error after feature extraction. The below is from Error Report.txt

`DBMS:
postgresql

Error:
cannot change working directory

SQL:
SELECT *
FROM (
SELECT row_id, covariate_id, covariate_value FROM cov_1 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_2 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_3 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_4 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_5 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_6 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_7 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_8 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_9 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_10 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_11 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_12 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_13 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_14 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_15 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_16 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_17 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_18 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_19 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_20 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_21 UNION ALL
SELECT row_id, covariate_id, covariate_value FROM cov_22
) all_covariates;`

My temp table tables are generated as well.

Can help me understand what's the issue? Where is this table located and how can I resolve this error? Which file should I look into to fix this issue?

CreateEvaluationCohort - xSensCohort parameter is missing

Hi,

I was trying to execute the CreateEvaluationCohort and found out that XSens Cohort is mandatory argument for this function but it is missing in the doc and tutorials.

The below is the error message that I got

Error in PheValuator::createEvaluationCohort(connectionDetails = connectionDetails, : argument "xSensCohort" is missing, with no default

However after including this parameter, the error doesn't occur anymore

Population Size from temp tables?

Hello @jswerdel,

while I was trying to inspect the other errors, I came across another scenario.

image

As you can see, may I know from where does PheVal get the population size?

I understand that "cases" count is from "XSpec" cohort but from where does "Population Size" count come from?

Because it is not from "XSens" or "total no of records in our db" cohort for sure. Because the count from XSens is ~4100 and db count is ~5200.

When I looked at the temp tables count, none of the tables have this count of 3849.

Can I kindly request you to help us understand this? From where does it get the population size()?

  1. Where can I find info on minimum number of subjects required to build a model? Is it shown in code anywhere? Yes, I understand that model built using 50 cases may not be useful. Is there anywhere in code where you have defined this condition?

Function and argument naming

Not all function and argument names are in line with the OHDSI code style. For example, estPPV should be estPpv.

Also, I highly recommend avoiding abbreviations when possible to make code easier to read. For example:

  • estPpv could be estimatedPpv
  • createPhenoModel could be createPhenotypeModel
  • exclCohort could be xSensCohort
  • mainPopnCohort could be ? (not sure what this argument does)

etc.

randomSplitter error

Hello,

I encountered the below error while I was trying to execute the createevaluationcohort function under PheValuator

Warning: This function is deprecated. Use 'randomSplitter' instead.
Error in randomSplitter(population = population, test = test, train = train,  : 
  Outcome only occurs in fewer than 10 people or only one class

Though this error didn't occur when I ran the same code couple of days back with same cohort definitions, am just trying to understand why this happens now? Because my outcome (XSpec) cohort has 2324 patients. So basically the outcome is occurring in more than 2000 cases. Why does the error message say that it's only for 10 people. Moreover, why doesn't it pull the negative cases from the population. Earlier, during model creation we saw that the code would pull the corresponding negative cases to match the prevalence. May I know why doesn't it happen now?

I see that the population size is set to cases count. Can I kindly check with you on this please? I might be wrong here but whoever runs the code, won't they encounter the same issue? Because if our population size is going to be cases count (which is only one class) and doesn't consider the negative cases at all. Your inputs to help me understand this would be very much appreciated

Warning - No Non-Null Coefficients

Hello @jswerdel ,

While I tried to execute the CreatePhenotypeModel, I had the below warning messages.

  1. Can you help me understand what do they mean and why does it occur?

One confusing part is, I didn't make any changes to my cohorts (XSpec, XSens) in this function and I was able to see the AUC etc for this when I ran for 2-3 times. But when I restarted R and ran again, I had the below warning messages and there is no output. Can you help?

Meaning this happens intermittently and not always.

Warning: No non-zero coefficients
Getting predictions on train set
Warning: Model had no non-zero coefficients so predicted same for all population...
Prediction took 0.006 secs
Warning: Evaluation not possible as prediciton NULL or all the same values

  1. Can XSpec and XSens be identical? Meaning for ex: 30 subjects presented in XSpec are the only subjects presented in XSens? (Meaning XSens minus XSpec = 0 subjects)

I know this seems practically possible but does tool allow this?

Require clarification on Terminologies - PheValuator

Hi @jswerdel ,

I know we had a brief discussion on this topic. Anyway I was also referring your other post here which got me a bit confused again because of varied terminologies used in forum, doc, youtube etc (could be I misunderstood as well). I have tried to read online and broke them step by step and I have got the questions covered in detail. So, I would kindly request you to read the full post once (only because am not sure whether questions are ordered in proper sequence context-wise), so that it would be easy for you to understand and respond accordingly. I am picking the example from forum user

population size - 10K
Lung Cancer - 9900 (atleast 1 code)
No lung cancer - 100

a) XSpec - will consist only of subjects who are highly likely to have an HOI(Lung cancer). also called as Noisy Positives cohort.

We expect subjects to have atleast 10 codes (depending on condition that we study) to be called as highly likely to have an HOI. Let's say we have 4500 people with atleast 10 codes

Q1) But why it isn't a gold standard? When a subject has more than 10 condition codes for lung cancer in his timeline we know for sure and be confident that he experienced the HOI. can there be any other interpretation to this? Sorry, am not from healthcare background. So your inputs to help me understand this on why it isn't gold standard will be much appreciated

b) XSens - This cohort is created by considering subjects who have 1 or more condition codes.
Let's say we have 9900 people with atleast 1 codes (4500 (>=10 codes)+5400 (<10 codes))

Q1) There will always be overlap between XSpec and XSens cohorts provided XSpec cohort fetches records based on our condition. In our example we see that 4500 (10 code people) are also present in XSens (along with rest 5400 people) Am I right?

Q2) XSens is different from Noisy Negatives cohort. They are two different cohorts. Am I right?

Q3) But there will never ever be any overlap between XSens and Noisy Negatives cohort. Am I right?

Q4) Because I see in YouTube Videos that XSens cohort is called as "Probably No" cohort but they do have evidence of HOI. Why is it then called as "Probably No" cohort?

c) Noisy Negatives - This cohort consists of subjects whoever is not present in XSens cohort. Am I right?

Q1) In our example of Lung Cancer, Noisy Negatives cohort will consists of 100 subjects. Am I right?

d) Prevalence cohort

Q1) Here we use XSens cohort because that's the cohort which gives proper estimate of prevalence of lung cancer in our cohort which is 9900. I guess almost always there will be no reason to use any other cohort (like XSpec) as prevalence cohort because they give inaccurate estimate. Am I right?

e) Target cohort

Q1) I see in the doc that Target Cohort is built as below

    Target cohort = XSpec + Noisy Negative (lung cancer example - 4500 + 100 = 4600)
   
    But may I know why this combination for Target cohort? 

Is it because ensuring a proper mix of Noisy Positives (high likely of having HOI) and Noisy Negatives samples (High likely of not having HOI) will help us study the characteristics of both the classes better?

f) Outcome cohort

Q1) What do we mean by outcome here? What is the outcome that we are looking for? Ex: in the case of forum example, if it's Lung Cancer, then are we looking for outcomes of Lung Cancer in the Target cohort?

Q2) So, How is this outcome cohort created? I couldn't find this anywhere in the doc.

g) PLP model

Q1) PheValuator uses LASSO regression model which is a supervised learning method requiring labels. Am I Right?

Q2) It trains the model based on the target cohort of 4600 subjects (Lung Cancer) and their variables. But how are the labels generated?

Q3) But again, in YouTube video I see that model is built based on XSpec & XSens. Is it how that works? But then why Target cohort is built using XSpec and Noisy Negatives? confusing.

Q4) The trained model is then evaluated on the Evaluation cohort (which is already used during the training phase. Refer below)?

h) Evaluation cohort

Q1) How is this evaluation cohort created? I understand that there is a function called "CreateEvaluationCohort" in the package and it uses a function parameter called XSpec cohort. There is no other cohort involved. Should I infer that Evaluation cohort will also be of 4500 subjects (same as XSpec)?

But in the doc, I see that " evaluation cohort - a large group of randomly selected subjects to be used to evaluate the phenotype algorithms (PA)."

But may I kindly request you to help me understand how is this random?. Keeping the imbalanced dataset issue and PheValuator limitations aside, can you help me understand how is this cohort is built in our example of Lung Cancer?

Q2) So are we evaluating our PLP model on the subjects under this evaluation cohort to produce the probabilities output. Am I right? But aren't these subjects already used during PLP model creation?

i) Main Population cohort

Q1) May I know what is the use of this cohort? Is it about just defining a cohort which will have all our subjects in the database? In the example of Lung cancer, it is 10K. Am I right? Should I just create a cohort in Atlas which will have all the subjects in our database?

Unable to run Chronic Phenotype Model

Hello @jswerdel ,

I see that you have modified the package and it's functions. Thanks for your effort.

But while I was trying to run the createChronicPhenotypeModel, I had the below error

2020-03-30 17:04:02 Coercing LHS to a list
2020-03-30 17:04:02 Coercing LHS to a list
Connecting using PostgreSQL driver

Constructing the at risk cohort
|==========================================================================================================================| 100%
Executing SQL took 0.073 secs
Fetching cohorts from server
Loading cohorts took 0.316 secs
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
org.json.JSONException: JSONObject["temporal"] is not a Boolean.

ML models in Phevaluator

Hello,

Does PheValuator currently only support below settings

a) Lasso model
b) 75:25 train test split
c) Default hyperparameters for lasso?

Other models will be in upcoming versions? Or we can change models in current version as well. I don't see any function parameter to change though

Time taken to create Phenotype model features

Hello @jswerdel,

May I check with you on how long can it take to createPhenoTypeModel? I understand it could depend on dataset size but our dataset size is only 5200.

When I execute my XSpec cohort in Atlas, it takes 20 mins to produce the output (4786) in Atlas and XSens cohort also takes 20 mins to produce the output (3540) in Atlas. too many rules. might be it's taking time due to that.

But will the Atlas cohort creation time affect CreatePhenotypeModel function execution as well?

Because for 4200~4700 cases, it is taking more than an hour and still not done yet. stuck on 24% for a long time.

image

I tried both in laptop and desktop, it's the same.

Or is it because I have around 15-20K concepts to exclude? Meaning I have only parents concepts.

Is that why it's taking longer time? I tried with both concept_identifier_list and included_concept_identifier_list.

Though I have "AddDescendentstoExclude=True", model throws NULL warning when I use "concept_identifier_list".

Whereas when I use "Included_concept_identifier_list", it works but still under execution for a long time?

Is it expected?

Can help me understand this?

Why cohort ends after 1 day of start date?

Hello Everyone,

In our dataset the observation period start date is 1900-01-01 and observation period end date is 3900-01-01 for all patients.

They do have diagnosis/lab/drug recorded etc in 2100-0x-0x years.

I understand the way we have defined obs period will create issues for Age (which we will handle it internally) but can help me with below questions

For ex: cohort id = 103 was designed as like below

image

So when this is the case, why do I see a temp table (in scratch schema) being generated like below

image

  1. Why does the cohort end after 1 day itself?

  2. Why do I see subjects under cohort_id = 0 (Because my 3 cohorts only have Ids=103,104&105)

  3. Why does every time I run the script, I get different number of subjects under this temp table?

In the first run, it was 5382 and 2nd run it was 5376

ERROR: relation "ohdsi.cohort_person" does not exist

I am running PheValuator for the first time.

CovSettings <- createDefaultCovariateSettings(startDayWindow1 = 0,
endDayWindow1 = 10,
startDayWindow2 = 11,
endDayWindow2 = 20,
startDayWindow3 = 21,
endDayWindow3 = 30)

CohortArgs <- createCreateEvaluationCohortArgs(xSpecCohortId = 47,
xSensCohortId = 12,
prevalenceCohortId = 87,
evaluationPopulationCohortId = 73,
covariateSettings = CovSettings)
conditionAlg1TestArgs <- createTestPhenotypeAlgorithmArgs(phenotypeCohortId = 1778259)
analysis1 <- createPheValuatorAnalysis(analysisId = 1,
description = "PD(prevalent)",
createEvaluationCohortArgs = CohortArgs,
testPhenotypeAlgorithmArgs = conditionAlg1TestArgs)
pheValuatorAnalysisList <- list(analysis1)

Got the following:

Error in .createErrorReport():
! Error executing SQL:
org.postgresql.util.PSQLException: ERROR: relation "ohdsi.cohort_person" does not exist.

image

Does anyone experience this issue? Appreciate guidance about how to fix it.
Thank you.
A.

Vignette issues

There are some issues with the vignette:

  1. The ATLAS links don't work
  2. Results are not included. Specifically, the sentence "The results from above will look like:" is followed by an empty line.

Also, if I may make a recommendation: do not describe all argument of the functions in the vignette. That is what the function reference is for. Only describe the arguments the user really has to change (like the cohort IDs).

Allow user to specify outTable

Currently PheValuator generates the table name itself. Why not let the user pick it (may default to current value)?

Also: is a temp table not an option?

Question - Expected Value - Cut points

Hi,

I have been reading about the Phevaluator package and it's amazing.

But I have a quick question on what does "EV" (Expected Value) mean under "Cut points" parameter?

I understand we have usually choose a threshold of 0.5 to discriminate the classes (positive or negative). Here cut points are nothing but thresholds am I right?

0.5 - +ve
<0.5 - -ve

But may I know what does "EV" mean? What value is chosen to discriminate between the classes?

How different it is from other threshold values like 0.1,0.2,0.3,0.4.0.5,0.6.0.7 etc

Can help with this?

Row_number logic in cohort table query

Hi,

May I kindly request you to help me understand the logic of generating row_number and what does it do? I mean I understand that you concat a random number with the current timestamp but why can't a normal row number be used?

Why is Checsum and MD5 is used? I understand these functions/utilities are used during file transfer to verify the integrity of files.

select co.*, p.*,
      row_number() over (order by ABS(CHECKSUM(MD5(RANDOM()::TEXT || CLOCK_TIMESTAMP()::TEXT))) % 123456789) rn
    from s1.depat co
    join s2.person p
      on co.subject_id = p.person_id

I can understand that it is not about just generating sequential row_numbers but unfortunately postgresql doesn't have these functions like checksum and it all fails.

Is there any generalized way to write this row_number which can work in postgresql as well ? Will it create a problem if I generate random row numbers without checksum or abs function. Just like as shown below

MD5(RANDOM()::TEXT || CLOCK_TIMESTAMP()::TEXT) #this works in postgresql

(RANDOM()::TEXT || CLOCK_TIMESTAMP()::TEXT) # this works as well

ABS cannot be used because checksum fails in postgresql which in turn goes on to explain that division by 123456789 cannot be done as well

But I see that this rn (row_number) is being used in several where clauses in the query file CreateCohortsV6.sql. So can help me understand this?

Do you have any suggestions here? Or Am I making any mistakes here?

Future dates not supported

Hello,

We shifted date records to future (lik 2500-01-01 etc) in our data-source (by retaining the chronological order) to maintain privacy.

When I executed the createPhenotypeModel function, I was able to see the below error message

Error: Error executing SQL:
org.postgresql.util.PSQLException: ERROR: function pg_catalog.date_part(unknown, unknown) is not unique
  Hint: Could not choose a best candidate function. You might need to add explicit type casts.
  Position: 817

Upon investigating the error report, I was able to see that you set future dates to 1900 which finally ends up giving zero records and date condition applied on these zero records results in the above error. I manually executed the SQL code provided in attached Error report

errorReport.txt

Can help us with this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.