GithubHelp home page GithubHelp logo

tonydeep / mimic_extract Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mlforhealth/mimic_extract

0.0 2.0 0.0 466 KB

MIMIC-Extract:A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III

License: MIT License

Python 7.21% Jupyter Notebook 91.32% Makefile 0.31% Shell 0.16% TSQL 1.00%

mimic_extract's Introduction

MIMIC-Extract:A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III

About

This repo contains code for MIMIC-Extract. It has been divided into the following folders:

  • Data: Locally contains the data to be extracted.
  • Notebooks: Jupyter Notebooks demonstrating test cases and usage of output data in risk and intervention prediction tasks.
  • Resources: Consist of Rohit_itemid.txt which describes the correlation of MIMIC-III item ids with those of MIMIC II as used by Rohit; itemid_to_variable_map.csv which is the main file used in data extraction - consists of groupings of item ids as well as which item ids are ready to extract; variable_ranges.csv which describes the normal variable ranges for the levels assisting in extraction of proper data. It also contains expected schema of output tables.
  • Utils: scripts and detailed instructions for running MIMIC-Extract data pipeline.
  • mimic_direct_extract.py: extraction script.

Paper

If you use this code in your research, please cite the following publication:

Shirly Wang, Matthew B. A. McDermott, Geeticka Chauhan, Michael C. Hughes, Tristan Naumann, 
and Marzyeh Ghassemi. MIMIC-Extract: A Data Extraction, Preprocessing, and Representation 
Pipeline for MIMIC-III. arXiv:1907.08322. 

Step-by-step Instructions

Step 0: Required software and prereqs

Your local system should have the following executables on the PATH:

  • conda
  • psql (PostgreSQL 9.4 or higher)
  • git
  • MIMIC-iii psql relational database (Refer to MIT-LCP Repo)

All instructions below should be executed from a terminal, with current directory set to utils/

cd utils/

Step 1: Setup env vars for current local system

Edit setup_user_env.sh so all paths point to valid locations on local file system and export those variables.

source ./setup_user_env.sh

Step 2: Create conda environment

Next, make a new conda environment from mimic_extract_env.yml and activate that environment.

conda env create --force -f ../mimic_extract_env.yml
conda activate mimic_data_extraction

Expected Outcome

The desired enviroment will be created and activated.

Expected Resources

Will typically take less than 5 minutes. Requires a good internet connection.

Step 3: Build Views for Feature Extraction

Materialized views in the MIMIC PostgreSQL database will be generated. This includes all concept tables in MIT-LCP Repo and tables for extracting non-mechanical ventilation, and injections of crystalloid bolus and colloid bolus.

make build_concepts

Step 4: Set Cohort Selection and Extraction Criteria

Parameters for the extraction code are specified in build_curated_from_psql.sh. Cohort selection criteria regarding minimum admission age is set through min_age; minimum and maximum length of ICU stay in hours are set through min_duration and max_duration. Only vitals and labs that contain over min_percent percent non-missingness are extracted and extracted vitals and labs are clinically aggregated unless group_by_level2 is explicitly set. Outlier correction is applied unless var_limit is set to 0.

Step 5: Build Curated Dataset from PSQL

make build_curated_from_psql

Expected Outcome

The default setting will create an hdf5 file inside MIMIC_EXTRACT_OUTPUT_DIR with four tables:

  • patients: static demographics, static outcomes

    • One row per (subj_id,hadm_id,icustay_id)
  • vitals_labs: time-varying vitals and labs (hourly mean, count and standard deviation)

    • One row per (subj_id,hadm_id,icustay_id,hours_in)
  • vitals_labs_mean: time-varying vitals and labs (hourly mean only)

    • One row per (subj_id,hadm_id,icustay_id,hours_in)
  • interventions: hourly binary indicators for administered interventions

    • One row per (subj_id,hadm_id,icustay_id,hours_in)

Expected Resources

Will probably take 5-10 hours. Will require a good machine with at least 50GB RAM.

Setting the population size

By default, this step builds a dataset with all eligible patients. Sometimes, we wish to run with only a small subset of patients (debugging, etc.).

To do this, just set the POP_SIZE environmental variable. For example, to build a curated dataset with only the first 1000 patients, we could do:

POP_SIZE=100 make build_curated_from_psql

mimic_extract's People

Contributors

mmcdermott avatar shirly1024 avatar

Watchers

James Cloos avatar paper2code - bot avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.