GithubHelp home page GithubHelp logo

ctsit / nacculator Goto Github PK

View Code? Open in Web Editor NEW
9.0 16.0 24.0 1.09 MB

Converts a CSV data file exported from REDCap into the NACC's UDS3 fixed-width format.

License: BSD 2-Clause "Simplified" License

Python 99.90% Makefile 0.01% JavaScript 0.03% Shell 0.07%
python csv adrc redcap

nacculator's Introduction

NACCulator

DOI

NACCulator is a Python 3-based data converter that changes REDCap .csv exported data to NACC’s fixed-width .txt format. It is configured for UDS3 forms, including FTLD and LBD (versions 3.0 and 3.1). It will perform basic data integrity checks during a run: verifying that each field is the correct type and length, verifying that there are no illegal characters in the Char fields, verifying that Num fields are within the acceptable range as defined in NACC's Data Element Dictionary for each form, and checking that no blanking rules have been violated. NACCulator outputs a .txt file that is immediately ready to submit to NACC's database.

Note: NACCulator requires Python 3.

HOW TO Convert from REDCap to NACC

To install NACCulator, run:

$ pip3 install git+https://github.com/ctsit/nacculator.git

Once the project data is exported from REDCap to the CSV file data.csv, run:

$ redcap2nacc <data.csv >data.txt

This command will work only in the simplest case; UDS3 IVP data only. NACCulator will automatically skip PTIDs with errors, so the output data.txt file will be ready to submit to NACC. In order to properly filter the data in the csv, NACCulator is expecting that REDCap visits (denoted by redcap_event_name) contain certain keywords: "initial" for all initial visit packets (including telephone and optional modules such as lbd), "follow" for all followups (including version 3.1 telephone and optional modules), "milestone" for milestone packets, "neuropath" for neuropathology packets, "tele" for old (version 3.0) telephone followups, "covid" for covid-related survey packets

NACCulator collects data from the Z1X form first and uses that to determine the presence of other forms in the packet. The Z1X form for that record must be marked "Unverified" or "Complete" for NACCulator to recognize the record, and each optional form must be marked as submitted within the Z1X for NACCulator to find those forms.

Note: output is written to STDOUT; errors are written to STDERR; input is expected to be from STDIN (the command line) unless a file is specified using the -file flag.

Usage

$ redcap2nacc -h
usage: redcap2nacc [-h]
                   [-fvp | -ivp | -tip | -tfp | -tfp3 | -np | -np10 | -m | -cv | -csf | -f {cleanPtid,replaceDrugId,fixHeaders,fillDefault,updateField,removePtid,removeDateRecord,getPtid}]
                   [-lbd | lbdsv | -ftld] [-file FILE] [-meta FILTER_META] [-ptid PTID]
                   [-vnum VNUM] [-vtype VTYPE]

Process redcap export data through nacculator.

optional arguments:
  -h, --help            show this help message and exit
  -fvp                  Set this flag to process as FVP data
  -ivp                  Set this flag to process as IVP data
  -tfp                  Set this flag to process as Telephone Followup Packet v3.2 data
  -tip                  Set this flag to process as Telephone Initial Packet data
  -tfp3                 Set this flag to process as TFP v3.0 (pre-2020) data
  -np                   Set this flag to process as Neuropathology version 11 data
  -np10                 Set this flag to process as Neuropathology version 10 data
  -m                    Set this flag to process as Milestone data
  -cv                   Set this flag to process as COVID data
  -csf                  Set this flag to process as NACC BIDSS CSF data

  -f {cleanPtid,replaceDrugId,fixHeaders,fillDefault,updateField,removePtid,removeDateRecord,getPtid}, --filter {cleanPtid,replaceDrugId,fixHeaders,fillDefault,updateField,removePtid,removeDateRecord,getPtid}
                          Set this flag to run the data through the chosen filter
  -lbd                  Set this flag to process as Lewy Body Dementia data (FORMVER = 3)
  -lbdsv                Set this flag to process as Lewy Body Dementia short version data (FORMVER = 3.1)
  -ftld                 Set this flag to process as Frontotemporal Lobar Degeneration data

  -file FILE            Path of the csv file to be processed
  -meta FILTER_META     Input file for the filter metadata (in case -filter is used)
  -ptid PTID            Ptid for which you need the records
  -vnum VNUM            Visit number for which you need the records
  -vtype VTYPE          Visit type for which you need the records

Example - Process a Neuropathology form:

$ redcap2nacc -np -file data.csv >data.txt

Example - Processing LBD Follow-up visit packets:

redcap2nacc -lbd -fvp -file data.csv >data.txt

Both LBD / LBDSV and FTLD forms can have IVP or FVP arguments.

Example - Run data through the cleanPtid filter:

$ redcap2nacc -f cleanPtid -meta nacculator_cfg.ini <data.csv >filtered_data.csv

HOW TO Filter Data Using NACCulator

If your data is not clean enough to be processed by NACCulator, there are some built in functions to clean (read: transform) the data.

In order to properly use the filters, the first step is to check and validate that nacculator_cfg.ini has the proper settings for the filter to run. In order to create this file, find the nacculator_cfg.ini.example file and remove the .example portion, and then fill in your center's information. The config file contains sections with in-code filter function name. Each of these sections contains elements necessary for the filter to run. The filters described below will discuss what is required, if anything.

The filters can be run all at once with your REDCap API token using:

$ nacculator_filters nacculator_cfg.ini

You can find more details on nacculator_filters under the section: HOW TO Acquire current-db-subjects.csv for the filters

RUNNING ALL FILTERS ON A LOCAL FILE

REDCap has an export size limit that can be exceeded with a large project like the ADRC. When the size of the project surpasses the REDCap limit, the nacculator_filters command will no longer work. The data must be manually exported from the project in chunks (whether by event or by ptid). However you choose to export the data, keep in mind that all of the fields in a packet need to be present in the input csv you use. So, for example, the A1 and A2 forms in the IVP cannot be exported and run separately through NACCulator.

You can still run all the filters using your config file on a REDCap-exported csv, even when not using nacculator_filters. The command to use this filter locally is:

$ python3 nacc/local_filters.py nacculator_cfg.ini redcap_input.csv

where redcap_input.csv is the location of the file you want to filter. The filter will then run as normal, creating a run_CURRENT-DATE folder and depositing each stage of the filter process in this folder. The final output of the filter process is a csv file called final_Update.csv which can then be run through NACCulator.

RUNNING INDIVIDUAL FILTERS

The filters can also be run one at a time on a .csv file with the -f and -meta flags.

For example, to run the fixHeaders filter:

$ redcap2nacc -f fixHeaders -meta nacculator_cfg.ini <data_input.csv >filtered_output.csv

If the filter requires the config, it must be passed with the -meta flag like the example above shows.

  • cleanPtid

    This filter requires a section in the config called filter_clean_ptid. This section will contain a single key filepath which will point to a csv (usually called current-db-subjects.csv) file of ptids to be removed. All the records whose ptid with same packet and visit num found in the passed meta file will be discarded in the output file. This filter also removes events that lack a visit number in REDCap.

    Example meta file:

    Patient ID,Packet type,Visit Num,Status
    110001,I,1,Current
    110001,M,M1,Current
    110003,I,001,Current
    110003,F,002,Current
    
  • replaceDrugId

    This filter replaces the first character of non empty fields of columns drugid_1 to drugid_30 with character "d".

  • fixHeaders

    This filter requires a section in the config called filter_fix_headers with as many keys as needed to replace the necessary columns. See example below. This filter fixes the column names of any column found in the filter mapping. This filter does not check for any data. It only replaces the column names if found.

    For example, the configuration would look like this:

    [filter_fix_headers]
    c1s_2a_npsylan: c1s_2_npsycloc
    c1s_2a_npsylanx: c1s_2a_npsylan
    b6s_2a1_npsylanx: c1s_2a1_npsylanx
    fu_otherneur: fu_othneur
    fu_otherneurx: fu_othneurxs
    fu_strokedec: fu_strokdec
    fukid9agd: fu_kid9agd
    fusib17pdx: fu_sib17pdx
    
  • fillDefault

    This filter is used to set some predefined fields to their corresponding predefined values. Below are the current defaults :

    nogds    -> 0
    formver  -> 3
    

    If field is blank, it will be updated to default value.

  • updateField

    This filter is used to update fields that already had a value in the REDCap export. Currently, only adcid is updated.

  • fixVisitNum

    This filter is used to ensure that the visitnum field is always an integer. It is currently only accessible from the config file when running all filters.

  • removePtid

    This filter requires a section in the config called filter_remove_ptid with a single key called ptid_format. The value for that key is a regex string to match ptids that are to be kept. 11\d.* keeps all PTIDs that fit the format 11xxxx, such as 110001.

    This filter is used to remove ptids that may have a different set of ids for a different study, or help limit which ids show up in the final result.

    config:
    ptid_format: 11\d.*
    
  • removeDateRecord

    This filter is used to remove records who may be missing visit dates. It searches for rows missing the visit day, month, or year. If any of those fields are missing, it removes the row.

  • getPtid

    This filter is used to get information about a single PatientID and is not present in the config file. You need to use the -ptid flag to specify the patient ID. You can use the -vnum to get the records with particular visit number and Patient ID or use -vtype to get records with particular visit type and Patient ID.

      $ redcap2nacc -f getPtid -ptid $SOME_PATIENT_ID -vnum $SOME_VISIT_NUM -vtype $SOMEVISIT_TYPE <data.csv >data.txt
    

HOW TO Acquire current-db-subjects.csv for the filters

This file is a csv that determines which of your center's PTIDs are already present in NACC's current database using the patient's PTID, the packet type (ivp or fvp, etc), the visit number, and the status (working or current). In order to get it, you need to use the contents of tools/preprocess/get_subject_list.js. The script is meant to be run on the "Finalize Data" page of the NACC UDS3 upload system.

Navigate to "Finalize Data" and right-click anywhere on the page. Select "Inspect" or "Inspect element" to open the browser's Inspect panel. Click on the "Console" tab and copy/paste the contents of get_subject_list.js into the console. Then, press the "Enter" or "Return" key on your keyboard. This will collect all of the PTIDs in your center's Working and Current databases into a csv called current-db-subjects.csv in your Downloads folder. You may then move it to whatever location you specified in your nacculator_cfg.ini file.

The csv is used by the filter_clean_ptid filter to identify and cull all packets already in NACC's Current database from your input csv. It is used to make NACCulator run faster for very large databases.

Example Workflow

Once you have edited the nacculator_cfg.ini file with your API token and desired filters, you can get a filtered CSV file of the raw REDCap data with:

$ nacculator_filters nacculator_cfg.ini

This will create a run folder labeled with the current date ($run_CURRENT-DATE) (for example, run_01-01-2000) that contains the csv and each iteration of filter, ending with final_update.csv.

Note: The files created by redcap2nacc will not be in the run folder created by run_filters.py. They will be in the base directory. The filepaths in the following commands are modified so that the output is deposited in your $run_CURRENT-DATE folder.

Next, you will need to run the actual redcap2nacc program to produce the fixed width text file for NACC. One type of flag can be used at a time, so the program must be run once for each type of packet.

$ redcap2nacc -ivp < $run_CURRENT-DATE/final_Update.csv > $run_CURRENT-DATE/iv_nacc_complete.txt 2> $run_CURRENT-DATE/ivp_errors.txt
$ redcap2nacc -fvp < $run_CURRENT-DATE/final_Update.csv > $run_CURRENT-DATE/fv_nacc_complete.txt 2> $run_CURRENT-DATE/fvp_errors.txt

This will place the text files (iv_nacc_complete.txt) in the run folder created earlier, as well as a log of the run that contains any found errors (ivp_errors.txt).

Development

Quickstart

$ git clone https://github.com/ctsit/nacculator.git nacculator
$ cd nacculator
$ python3 -mvenv venv
$ source venv/bin/activate
$ pip install -e .

Files

This is not exhaustive, but here is an explanation of some important files.

  • nacc/: top-level Python package for all things NACC.

  • nacc/redcap2nacc.py: converts a CSV data file exported from REDCap into NACC's UDS3 fixed-width format.

  • nacc/uds3/blanks.py: specialized library for "Blanking Rules".

  • nacc/uds3/ivp/forms.py: UDS3 IVP forms represented as Python classes.

  • tools/generator.py: generates Python objects based on NACC Data Element Dictionaries in CSV. Used by developers to update the existing forms.py files as necessary.

  • nacculator_cfg.ini: configuration file for the filters, built from nacculator_cfg.ini.example in the root nacculator/ directory.

  • nacc/run_filters.py and tools/preprocess/run_filters.sh: pulls data from REDCap based on the settings found in nacculator_cfg.ini (for .py) and filters_config.cfg (for .sh).

Testing

To run all the tests:

$ python3 -m unittest

To run only the tests in a specific file:

$ python3 tests/test_$SPECIFIC_FILE.py

Generating Forms

Warning: the generator is currently broken due to changes in the CSV format.

You only need to generate forms when there are new DEDs from NACC. The NACCulator install includes the current forms automatically.

Before running the generator, read the warnings in ./nacc/uds3/ivp/forms.py first.

$ python3 tools/generator.py tools/uds3/ded/csv/ >nacc/uds3/ivp/forms.py
$ edit nacc/uds3/ivp/forms.py

Note: execute generator.py from the same folder as the corrected folder, which should contain any "corrected" DEDs.

Resources

nacculator's People

Contributors

ajantharamineni avatar alyxia913 avatar cooperm125 avatar devmattm avatar emilyolsen246 avatar kevinhanson avatar mbentz-uf avatar melimore86 avatar naomidb avatar rtdtwo avatar s-emerson avatar sinarest1608 avatar takirala avatar ufl-taeber avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nacculator's Issues

Update venv instructions in README

venv does not follow standard naming conventions for python virtual environments. Under Quickstart the instructions should be updated to change python3 -mvenv venv to be python3 -m venv .venv.

$ git clone https://github.com/ctsit/nacculator.git nacculator
$ cd nacculator
$ python3 -mvenv venv
$ source venv/bin/activate
$ pip install -e .

Nacculator erroring and stopping

Nacculator needs to handle errors better.

Currently it will except and then stop processing. It should instead do all that it can, and generate two files. One will be the output that it could handle, and the other will be an error log that shows what it could not do and why

When the row is blank in csv file, throw an exception

When the row is blank in csv file, throw an exception. This may happen sometimes as excel tends to insert blank rows when csv file is opened using it. Unable to replicate this but it happened sometimes. Simple fix is to check if the row is empty and then discard it.

`nacc/uds3/fvp/forms.py` fields `MCIN1LAN`, `MCIN1ATT`, `MCIN1EX`, `MCIN1VIS` checking `MCINON2` instead of `MCINON1`

I believe lines 738-741 of nacc/uds3/fvp/forms.py are checking the value of the MCINON2 field instead of MCINON1.

In the blanks list, the last item 'Blank if Question 5b MCINON2 ne 1' should be 'Blank if Question 5c MCINON1 ne 1'.

Bugs:

self.fields['MCIN1LAN'] = nacc.uds3.Field(name='MCIN1LAN', typename='Num', position=(79, 79), length=1, inclusive_range=(0, 1), allowable_values=['1', '0'], blanks=['Blank if Question 2 NORMCOG = 1 (Yes)', 'Blank if Question 3 DEMENTED = 1 (Yes)', 'Blank if Question 5b MCINON2 ne 1'])
self.fields['MCIN1ATT'] = nacc.uds3.Field(name='MCIN1ATT', typename='Num', position=(81, 81), length=1, inclusive_range=(0, 1), allowable_values=['1', '0'], blanks=['Blank if Question 2 NORMCOG = 1 (Yes)', 'Blank if Question 3 DEMENTED = 1 (Yes)', 'Blank if Question 5b MCINON2 ne 1'])
self.fields['MCIN1EX'] = nacc.uds3.Field(name='MCIN1EX', typename='Num', position=(83, 83), length=1, inclusive_range=(0, 1), allowable_values=['1', '0'], blanks=['Blank if Question 2 NORMCOG = 1 (Yes)', 'Blank if Question 3 DEMENTED = 1 (Yes)', 'Blank if Question 5b MCINON2 ne 1'])
self.fields['MCIN1VIS'] = nacc.uds3.Field(name='MCIN1VIS', typename='Num', position=(85, 85), length=1, inclusive_range=(0, 1), allowable_values=['1', '0'], blanks=['Blank if Question 2 NORMCOG = 1 (Yes)', 'Blank if Question 3 DEMENTED = 1 (Yes)', 'Blank if Question 5b MCINON2 ne 1'])

Fix:

self.fields['MCIN1LAN'] = nacc.uds3.Field(name='MCIN1LAN', typename='Num', position=(79, 79), length=1, inclusive_range=(0, 1), allowable_values=['1', '0'], blanks=['Blank if Question 2 NORMCOG = 1 (Yes)', 'Blank if Question 3 DEMENTED = 1 (Yes)', 'Blank if Question 5c MCINON1 ne 1'])
self.fields['MCIN1ATT'] = nacc.uds3.Field(name='MCIN1ATT', typename='Num', position=(81, 81), length=1, inclusive_range=(0, 1), allowable_values=['1', '0'], blanks=['Blank if Question 2 NORMCOG = 1 (Yes)', 'Blank if Question 3 DEMENTED = 1 (Yes)', 'Blank if Question 5c MCINON1 ne 1'])
self.fields['MCIN1EX'] = nacc.uds3.Field(name='MCIN1EX', typename='Num', position=(83, 83), length=1, inclusive_range=(0, 1), allowable_values=['1', '0'], blanks=['Blank if Question 2 NORMCOG = 1 (Yes)', 'Blank if Question 3 DEMENTED = 1 (Yes)', 'Blank if Question 5c MCINON1 ne 1'])
self.fields['MCIN1VIS'] = nacc.uds3.Field(name='MCIN1VIS', typename='Num', position=(85, 85), length=1, inclusive_range=(0, 1), allowable_values=['1', '0'], blanks=['Blank if Question 2 NORMCOG = 1 (Yes)', 'Blank if Question 3 DEMENTED = 1 (Yes)', 'Blank if Question 5c MCINON1 ne 1'])

Also check lines 743-746. The last item in the corresponding blanks lists read 'Blank if Question 5b MCINON2 ne 1', but I believe should be 'Blank if Question 5d MCINON2 ne 1'

Thanks.

Invalid literal for Decimal

When processing follow-up visit packets any time there is data typed in the free text field fu_othcondx NACCulator raises the error:

Traceback (most recent call last):
File "./nacc/redcap2nacc.py", line 209, in
main()
File "./nacc/redcap2nacc.py", line 197, in main
warnings += check_blanks(packet)
File "./nacc/redcap2nacc.py", line 33, in check_blanks
if f.blanks and not empty(f)]:
File "./nacc/redcap2nacc.py", line 83, in empty
return field.value.strip() == ""
File "/Users/kevinhanson/code/nacculator/nacc/uds3/init.py", line 76, in value
return self.udstype(self.val)
File "/Users/kevinhanson/code/nacculator/nacc/uds3/init.py", line 37, in call
decimal.Decimal(value)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/decimal.py", line 547, in new
"Invalid literal for Decimal: %r" % value)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/decimal.py", line 3873, in _raise_error
raise error(explanation)
decimal.InvalidOperation: Invalid literal for Decimal: 'chronic renal insufficiency depression GDS 14'

FTLD IVP blanking rule throws an error, for fields FTDFDMAM, FTDAMDID

Traceback (most recent call last):
  File "/usr/local/bin/redcap2nacc", line 11, in <module>
    load_entry_point('nacculator==1.7.0', 'console_scripts', 'redcap2nacc')()
  File "/usr/local/lib/python3.6/site-packages/nacculator-1.7.0-py3.6.egg/nacc/redcap2nacc.py", line 559, in main
  File "/usr/local/lib/python3.6/site-packages/nacculator-1.7.0-py3.6.egg/nacc/redcap2nacc.py", line 425, in convert
  File "/usr/local/lib/python3.6/site-packages/nacculator-1.7.0-py3.6.egg/nacc/redcap2nacc.py", line 71, in check_blanks
  File "/usr/local/lib/python3.6/site-packages/nacculator-1.7.0-py3.6.egg/nacc/ftld/blanks.py", line 124, in convert_rule_to_python
Exception: Could not parse Blanking rule: FTDFDMAM
Traceback (most recent call last):
  File "/usr/local/bin/redcap2nacc", line 11, in <module>
    load_entry_point('nacculator==1.7.0', 'console_scripts', 'redcap2nacc')()
  File "/usr/local/lib/python3.6/site-packages/nacculator-1.7.0-py3.6.egg/nacc/redcap2nacc.py", line 559, in main
  File "/usr/local/lib/python3.6/site-packages/nacculator-1.7.0-py3.6.egg/nacc/redcap2nacc.py", line 425, in convert
  File "/usr/local/lib/python3.6/site-packages/nacculator-1.7.0-py3.6.egg/nacc/redcap2nacc.py", line 71, in check_blanks
  File "/usr/local/lib/python3.6/site-packages/nacculator-1.7.0-py3.6.egg/nacc/ftld/blanks.py", line 124, in convert_rule_to_python
Exception: Could not parse Blanking rule: FTDAMDID

The blanking code looks a little wild for me, I'm not sure why it would be throwing an error, and not sure how to fix it.

Neuropath flag does not stop nacculator from attempting to process ivp and fvp data

There are two problems with the -np flag when running nacculator:

When running sample data from REDCap through nacculator using the -np flag, Kevin and I noticed that the output .txt file contains a lot of empty space. This is caused by nacculator creating rows for every form for every ptid in the .csv file, regardless of if the form is for neuropath data or not. I was expecting nacculator to behave the same way it does for the IVP and FVP packets, where all data not selected by the flag is filtered out and is not printed in the output file.
This example uses all of the REDCap project data from November 2019- all PTIDs, packets, and forms. The neuropath data is processed in the output .txt file, and gives the expected fixed-width format, but NACC will not accept the file for upload because of the empty space.

The other problem is that, once the file is pared down to relevant data, NACC still will not accept the file for upload, because it will not accept the center ID. I have checked, and 1Florida ADRC's ID is still 41, so I'm unsure what is causing this issue. The ID is printed in the correct columns in the .txt file, so that issue can at least be ruled out.

The license file needs to be updated

The license file should be updated to read 2016-2019 and included the 1Florida ADRC in addition to the University of Florida. Is this an MIT license or GNU license?

Install error from current README "python3 -m pip install git+https://github.com/ctsit/[email protected]#egg=cappy-2.0.0"

Console dump:

python3 -m pip install git+https://github.com/ctsit/[email protected]#egg=cappy-2.0.0
Collecting cappy-2.0.0
Cloning https://github.com/ctsit/cappy.git (to revision 2.0.0) to /private/var/folders/q_/01yhg7v96457h0d15qc7ww1m0000gn/T/pip-install-7jndrbyw/cappy-2-0-0_263158473d8e49799a7cc4afa8e8aea1
Running command git clone -q https://github.com/ctsit/cappy.git /private/var/folders/q_/01yhg7v96457h0d15qc7ww1m0000gn/T/pip-install-7jndrbyw/cappy-2-0-0_263158473d8e49799a7cc4afa8e8aea1
Running command git checkout -q 857729e81eca70fae5cb411f9092916d23876d1a
WARNING: Generating metadata for package cappy-2.0.0 produced metadata for project name cappy. Fix your #egg=cappy-2.0.0 fragments.
WARNING: Discarding git+https://github.com/ctsit/[email protected]#egg=cappy-2.0.0. Requested cappy from git+https://github.com/ctsit/[email protected]#egg=cappy-2.0.0 has inconsistent name: filename has 'cappy-2-0-0', but metadata has 'cappy'
ERROR: Could not find a version that satisfies the requirement cappy-2-0-0 (unavailable) (from versions: none)
ERROR: No matching distribution found for cappy-2-0-0 (unavailable)

Check for illegal characters in Char fields and SKIP processing

Example
For LBoANotH on LBD IVP the DED for Form E3L says:

Any text or numbers with the exception of single quotes (‘), double quotes (“), ampersands (&), and percentage signs (%).

If you enter can't the NACC upload system will reject this input.

Requested Change
Identify all fields that have this constraint, validate the input, and stop processing the record (SKIP) if the illegal characters are in the fields.

Milestone forms use inclusive_range instead of allowable_values

In the 'uds3' directory, in the 'm' folder, the forms for the milestone packet are structured in an unusual way compared to the other forms. Instead of "allowable_values" containing the specific values for multiple choice answers, the options are instead contained within "inclusive_range".
Example:
line 49 'FTLDREAS' has 'inclusive_range=(1, 4), allowable_values=[]' but in the DED on NACC's web site, 'FTLDREAS' has specific values 1, 2, 3, and 4 as available answers.
In nacculator's other forms files, a field like this would normally have the inclusive_range and allowable_values filled.

This is indicative of the need to re-generate the forms using the NACC-provided csv file (or a conversion of their pdf file) to ensure correct formatting. Problems like this are also causing some m1_test unit tests to fail as nacculator becomes able to detect more kinds of data errors.

Records with values in `fu_fadmut` = 8 (Form A3) raise KeyError: '2a'

I dug into this but can't quite figure it out. I get an error for the two records in our database that have fu_fadmut values of 8. Values of NA/null, 9 and 0 for fu_fadmut don't throw an error, only 8. (Other values for fu_fadmut are untested.) Maybe I'm barking up the wrong tree, but fu_fadmut values of 8 seems to be the only thing that's special or different about these 2 records.

Below is the output where fake IDs UM11112561 and UM11112562 result in a traceback that ends with KeyError: '2a'.

[START] ptid : UM11112561
[SKIP] Error for ptid : UM11112561
Traceback (most recent call last):
  File "/Users/ldmay/Box Sync/Documents/nacculator/nacc/redcap2nacc.py", line 167, in convert
    warnings += check_blanks(packet)
  File "/Users/ldmay/Box Sync/Documents/nacculator/nacc/redcap2nacc.py", line 38, in check_blanks
    if r(packet):
  File "/Users/ldmay/Box Sync/Documents/nacculator/nacc/uds3/blanks.py", line 98, in should_be_blank
    return packet[key] == value
  File "/Users/ldmay/Box Sync/Documents/nacculator/nacc/uds3/packet.py", line 39, in __getitem__
    raise KeyError(key)
KeyError: '2a'
[START] ptid : UM11112562
[SKIP] Error for ptid : UM11112562
Traceback (most recent call last):
  File "/Users/ldmay/Box Sync/Documents/nacculator/nacc/redcap2nacc.py", line 167, in convert
    warnings += check_blanks(packet)
  File "/Users/ldmay/Box Sync/Documents/nacculator/nacc/redcap2nacc.py", line 38, in check_blanks
    if r(packet):
  File "/Users/ldmay/Box Sync/Documents/nacculator/nacc/uds3/blanks.py", line 98, in should_be_blank
    return packet[key] == value
  File "/Users/ldmay/Box Sync/Documents/nacculator/nacc/uds3/packet.py", line 39, in __getitem__
    raise KeyError(key)
KeyError: '2a'

Thanks.

Generator tool no longer separates fields by form

tools/generator.py used to take a set of CSVs and create Python classes for each with the appropriate fields (see 3080134). This is no longer the case.

Steps to reproduce

  1. Download the IVP CSV files
$ mkdir ded_ivp
$ wget -P ./ded_ivp https://www.alz.washington.edu/NONMEMBER/UDS/DOCS/VER3/uds3dedheader.csv
$ wget -P ./ded_ivp https://www.alz.washington.edu/NONMEMBER/UDS/DOCS/VER3/uds3dedA1IVP.csv
$ wget -P ./ded_ivp https://www.alz.washington.edu/NONMEMBER/UDS/DOCS/VER3/uds3dedA2IVP.csv
$ wget -P ./ded_ivp https://www.alz.washington.edu/NONMEMBER/UDS/DOCS/VER3/uds3dedA3IVP.csv
$ wget -P ./ded_ivp https://www.alz.washington.edu/NONMEMBER/UDS/DOCS/VER3/uds3dedA4DIVP.csv
$ wget -P ./ded_ivp https://www.alz.washington.edu/NONMEMBER/UDS/DOCS/VER3/uds3dedA5IVP.csv
$ wget -P ./ded_ivp https://www.alz.washington.edu/NONMEMBER/UDS/DOCS/VER3/uds3dedB1IVP.csv
$ wget -P ./ded_ivp https://www.alz.washington.edu/NONMEMBER/UDS/DOCS/VER3/uds3dedB4IVP.csv
$ wget -P ./ded_ivp https://www.alz.washington.edu/NONMEMBER/UDS/DOCS/VER3/uds3dedB5IVP.csv
$ wget -P ./ded_ivp https://www.alz.washington.edu/NONMEMBER/UDS/DOCS/VER3/uds3dedB6IVP.csv
$ wget -P ./ded_ivp https://www.alz.washington.edu/NONMEMBER/UDS/DOCS/VER3/uds3dedB7IVP.csv
$ wget -P ./ded_ivp https://www.alz.washington.edu/NONMEMBER/UDS/DOCS/VER3/uds3dedB8IVP.csv
$ wget -P ./ded_ivp https://www.alz.washington.edu/NONMEMBER/UDS/DOCS/VER3/uds3dedB9IVP.csv
$ wget -P ./ded_ivp https://www.alz.washington.edu/NONMEMBER/UDS/DOCS/VER3/uds3dedC2IVP.csv
$ wget -P ./ded_ivp https://www.alz.washington.edu/NONMEMBER/UDS/DOCS/VER3/uds3dedD1IVP.csv
$ wget -P ./ded_ivp https://www.alz.washington.edu/NONMEMBER/UDS/DOCS/VER3/uds3dedD2IVP.csv
$ wget -P ./ded_ivp https://www.alz.washington.edu/NONMEMBER/UDS/DOCS/VER3/uds3dedZ1IVP.csv
  1. Run the old version:
$ git checkout 30801342e3d4ca3780769b4d2adcc36694e30a18
$ python2 tools/generator.py ./ded_ivp >ivpform-old.py
$ head -n20 ivpform-old.py
import nacc.uds3


def header_fields():
    fields = {}
    fields['PACKET'] = nacc.uds3.Field(name='PACKET', typename='Char', position=(1, 2), length=2, inclusive_range=None, allowable_values=[], blanks=[])
    fields['FORMID'] = nacc.uds3.Field(name='FORMID', typename='Char', position=(4, 6), length=3, inclusive_range=None, allowable_values=[], blanks=[])
    fields['FORMVER'] = nacc.uds3.Field(name='FORMVER', typename='Num', position=(8, 10), length=3, inclusive_range=(1, 3), allowable_values=[], blanks=[])
    fields['ADCID'] = nacc.uds3.Field(name='ADCID', typename='Num', position=(12, 13), length=2, inclusive_range=(2, 38), allowable_values=[], blanks=[])
    fields['PTID'] = nacc.uds3.Field(name='PTID', typename='Char', position=(15, 24), length=10, inclusive_range=None, allowable_values=[], blanks=[])
    fields['VISITMO'] = nacc.uds3.Field(name='VISITMO', typename='Num', position=(26, 27), length=2, inclusive_range=(1, 12), allowable_values=[], blanks=[])
    fields['VISITDAY'] = nacc.uds3.Field(name='VISITDAY', typename='Num', position=(29, 30), length=2, inclusive_range=(1, 31), allowable_values=[], blanks=[])
    fields['VISITYR'] = nacc.uds3.Field(name='VISITYR', typename='Num', position=(32, 35), length=4, inclusive_range=(2005, 2014), allowable_values=[], blanks=[])
    fields['VISITNUM'] = nacc.uds3.Field(name='VISITNUM', typename='Char', position=(37, 39), length=3, inclusive_range=None, allowable_values=[], blanks=[])
    fields['INITIALS'] = nacc.uds3.Field(name='INITIALS', typename='Char', position=(41, 43), length=3, inclusive_range=None, allowable_values=[], blanks=[])
    return fields


class FormB6(nacc.uds3.FieldBag):
    def __init__(self):
        self.fields = header_fields()
        self.fields['NOGDS'] = nacc.uds3.Field(name='NOGDS', typename='Num', position=(45, 45), length=1, inclusive_range=(0, 1), allowable_values=['9', '1', '0'], blanks=[])
        self.fields['SATIS'] = nacc.uds3.Field(name='SATIS', typename='Num', position=(47, 47), length=1, inclusive_range=(0, 1), allowable_values=['9', '1', '0'], blanks=[])
  1. Compare that output with the new version
$ git checkout develop
$ python3 tools/generator.py /Users/taeber/code/ctsit/star/adrc-forms/csvs/dict_ivp 
import nacc.uds3


Traceback (most recent call last):
  File "tools/generator.py", line 194, in <module>
    main()
  File "tools/generator.py", line 162, in main
    header = generate_header(os.path.join(data_dict_path, header_file))
  File "tools/generator.py", line 96, in generate_header
    form = generate(ded)
  File "tools/generator.py", line 71, in generate
    field.type = record['Data type']
KeyError: 'Data type'

What happened?

The version in develop errors out and does not produce useful python code.

What did you expect to happen?

I expected there to be no difference.

Add csv output support to NACCulator

From Ben Keller at NACC:
And, finally, we have an internal request to change nacculator so that it generates CSV. One of our developers was going to look into this, but maybe someone there could do it faster. This comes from Janene Hubbard who consults with people dealing with the fixed width fields.

NACCulator's default output is in fixed-width .txt format to account for NACC's standard submission format. They have requested that NACCulator have the option to output to .csv format instead. An example of what this output could look like is given in the NACC quarterly data freezes, where each participant has a separate row for each visit event. Right now, REDCap exports use separate columns for the variables in different visit events using prefixes (fu_ for the followup packet and tele_ for the telephone followup packet, for example). The new csv option for NACCulator would essentially remove the prefixes from the REDCap export and combine the "separate" columns for the same variable into one column. So, for example, mocacomp, fu_mocacomp, tele_mocacomp, and tip_mocacomp would all be combined under the "mocacomp" column, with each value belonging in a different row depending on visit number.

Nacculator needs to learn Nacc rules

Nacculator should be as correct as possible.

If there are rules that cause the Nacc to reject our uploads we need to bring them down into the nacculator so we can run into and deal with them earlier

Uploading ADRC data dictionary throws error

After vagrant up, trying to load the data dictionary for ADRC fails with the following error

There are variables used in the branching logic that are not listed as real variables in column A. All variables used in branching logic must exist in column A. Below are the variables not found in column A.
psmse_physamb (L2308)

Incorrect unknown value '999' for Form A2 INEDUC in uds3/fvp/forms.py

I believe the unknown value for Form A2 field INEDUC should be '99' instead of '999'.

Line 62 in uds3/fvp/forms.py current:
self.fields['INEDUC'] = nacc.uds3.Field(name='INEDUC', typename='Num', position=(315, 316), length=2, inclusive_range=(0, 36), allowable_values=['999'], blanks=['Blank if Question 3 NEWINF = 0 (No)'])

Possible correction:
self.fields['INEDUC'] = nacc.uds3.Field(name='INEDUC', typename='Num', position=(315, 316), length=2, inclusive_range=(0, 36), allowable_values=['99'], blanks=['Blank if Question 3 NEWINF = 0 (No)'])

Thanks!

Skip Z1 form if not present in REDCap project

Some ADRCs do not have the Z1 form in their REDCap projects (the form was deprecated some years ago and is not used today). Currently, nacculator needs there to be Z1 fields like "z1_form_complete" within the csv data file, even if the value is blank, in order to avoid raising a KeyError.
In order to solve this issue, we need to either add a flag, filter, configuration, or some internal logic so that nacculator knows to skip the Z1 form fields altogether if they are not present in the data file.

It might be a good idea to include other old forms like C1/C1S in this feature.

Deprecate C1S

This was a temporary Spanish form. Use C2.

C1S forms will have to be manually uploaded to NACC.

Add logging option to output

Currently, by default, NACCulator prints all progress and errors to stderr (the terminal). This makes the code take a long time to run.

In redcap2nacc, we add an argument to the command after the output file is specified. For example:

redcap2nacc -ivp <run_06-07-2022/final_Update.csv >run_06-07-2022/iv_nacc_complete.txt 2>run_06-07-2022/ivp_errors.txt

that prints all of this text to a log file called ivp_errors.txt.

We should add this option to the print statements of nacc/run_filters.py and nacc/uds3/filters.py so that the filters run faster.

It would probably look similar to how it works in redcap2nacc, which is, every print statement has a second argument called "file=sys.stderr", and then stderr is specified as a file with that "2>(filename)" argument when running the program. nacc/uds3/filters.py specifically needs this capability rather than nacc/run_filters.py, but run_filters.py is the base program that calls each function in filters.py.

Make Python DB Logger an optional feature

Problem:

Python DB Logger is a private repository, which makes it not usable by people outside CTS-IT. It also requires setup of MySQL containers on the machine Nacculator is running. This is a major problem if Nacculator is being run by anyone other than the CTS-IT team.

Solution:

Make DB Logger an optional feature by allowing a flag to be passed during runtime that enables or disables Python DB Logger's functionality.

Where can the code be added?

There are ways to implement this, one such way is as follows:

If the flag is set to enable the logs, the DB Logger instance is created as is

db_logger: DBLogger = DBLogger(
logging_instance,
ConnectionHelper(dot_env_file='.env').connect_to_mysql(),
write_to_prod=True
)

But when it is disabled, we can set the instance to None.

Importing the instance will then either be a DBLogger instance or None. To ensure null safety, all DB Logger calls should be wrapped inside an if-else block that checks if the instance is None or not.

A drawback of this method is that it introduces a lot of redundant and dirty-looking extra if-else statements to the code. Hence, a better solution will be incorporate an enable/disable functionality in Python DB Logger itself.

Duplicate DrugIDs

When uploading subjects to NACC the drug IDs in form A4 are duplicated for Initial Visit Packets. Follow-up remains untested at this moment.

Detect Z1-Z1X form usage

All visits before 1 April 2018 should have a Z1 form. All visits on or after 1 April 2018 should have a Z1X form.

Add logic to be aware of this switch and produce the required form.

Nacculator does not upload data to Nacc automatically

Nacculator output needs to be manually uploaded to the Nacc database.

Since there is no API for the Nacc, in order to automate this procedure, there needs to be some sort of browser scripting to upload.

Possible options include:

  • Phantom js
  • Selenium

New Filter For 3 date fields with 88/88/8888

There is a form (I forget which right now) that has three date fields - one for day, month, year. Clinicians frequently put 88/88/8888 or 88/88/88 in the day field which is incorrect. This should be split into 88 for day, 88 for month, and 8888 for year. At the same time, 88 for year should be fixed to 8888

TFP B6 form does not set a "blank" NOGDS field to 0

In the IVP and FVP modules, the REDCap forms have a different build for B6 than the TFP module.
The IVP and FVP have the first question (Able to complete / Not able to complete) as a yes/no radio button field with possible values of 0 (Able to complete) or 1 (Not able to complete). However, the TFP has that question as a single radio button field with a possible value of 1 (Not able to complete) and is left blank if the participant was able to complete the form.
The issue is that NACC still expects to see a value of 0 or 1 for the NOGDS field, even though the 0 option is not present on the REDCap form.
The TFP module needs an addition to the "set_to_zero_if_blank" function so that NACC does not interpret that field as missing a value. In the meantime, it must be entered manually within NACC's Working Database.

This issue will be fixed with the upcoming TIP module feature branch.

Same csv header value may corrupt the output.

We are parsing the input csv in to a python dictionary and if two headers have the same name, then the second value of the header will over write the first value and it will result in an error.

For example, C2 and C1 have a field called fu_npsylan. If there is data in both these forms, then nacculator may not work as expected.

Vagrant project config != prod ?

We need to make sure that the project in the vagrant matches what is up on prod and then that stage matches prod because it is good

Incorporate `python_db_logger`

Add python_db_logger to Nacculator:

  • Add an example.env with the necessary DB_* environment variables
  • Add a local mysql container definition. Copy the folder structure (.containers/mysql) and docker-compose.yml from python_db_logger
  • Update README with instructions on how to start the docker container
  • Add python_db_logger as a dependency in setup.py:
install_requires=[
      "python_db_logger @ git+ssh://[email protected]:/ctsit/python_db_logger.git"
  ]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.