Computational platform applied to large-scale M. tuberculosis antimicrobial resistance (AMR) dataset, as described in,
ES. Kavvas, E. Catoui, N. Mih, JT. Yurkovich, Y. Seif, N. Dillon, D. Heckmann, A. Anand, L. Yang, V. Nizet, JM. Monk, BO. Palsson Machine learning and structural analysis of Mycobacterium tuberculosis pan-genome identifies genetic signatures of antibiotic resistance, Nature Communications, (2018) 9:4306
Installation
git clone https://github.com/erolkavvas/microbial_AMR_ML.git
01_pairwise_tests.ipynb
- Determines pairwise associations between pan-genome alleles and labeled phenotypes.
- Generates
Supplementary Data File 1
02_ML_ensemble_SVM.ipynb
- Performs machine learning (ensemble support vector machine) for selecting groups of alleles that are predictive of the labeled phenotypes.
- Generates
Supplementary Data File 2
,Supplementary Data File 3
, andsvm_ensemble_data
03_epistatic_analysis.ipynb
- Uses the data generated by
02_ML_ensemble_SVM.ipynb
to select an initial set of gene-gene pairs, and then performs gene-gene logistic regression modeling of these gene-gene pairs to identify statistical significant genetic interactions. - Generates
cooccurence_table_excel
,cooccurence_table_figures
, andSupplementary Data File 4
- Uses the data generated by
The following dataframes are required inputs for the computational platform.
cluster_info.csv
clust_to_rv | gene_name | ortho | cog | product | refseq | count | score | name_to_rv | pan | |
---|---|---|---|---|---|---|---|---|---|---|
Cluster 0 | Rv2048c | pks12 | 653045.Strvi_4160 | Q | Polyketide synthase | AN47_01827 | 1590 | 7958.6 | 0 | Core |
Cluster 1 | Rv3344c | PE_PGRS49 | 0 | 0 | PE-PGRS family protein | X171_03503 | 794 | 0.0 | 0 | Acces |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
pangen_allele_df.csv
Genome ID | ... | Cluster0_16 | Cluster0_17 | ... |
---|---|---|---|---|
1010834_3 | ... | 1 | ... | |
1010835_3 | ... | 1 | ... | |
1010836_3 | ... | 1 | ... | |
... | ... | ... | ... | ... |
pangen_cluster_df.csv
Genome ID | Cluster 0 | Cluster 1 | Cluster 2 | ... |
---|---|---|---|---|
1438838_3 | 1 | 1 | 0 | ... |
1408941_4 | 1 | 1 | 0 | ... |
1422035_3 | 1 | 0 | 0 | ... |
... | ... | ... | ... | ... |
resistance_data.csv
genome_id | isoniazid | rifampicin | ethambutol | ... |
---|---|---|---|---|
1295764_3 | R | R | R | ... |
1423468_3 | R | R | S | ... |
... | ... | ... | ... | ... |
- patsy [https://patsy.readthedocs.io/en/latest/] - for logistic regression modeling.
- entropy_estimators.py [https://github.com/gregversteeg/NPEET] - for pairwise association analysis.