Promoter-analysis: revealing the potential role of transcription factors in a given physiological process using RNA-Seq data
These scripts was used for research published at https://www.mdpi.com/2223-7747/9/9/1176/htm to:
- Search cis-regulatory elements (CREs) within gene promoter regions using position weight matrices (PWMs) obtained from PlantPAN3.0 (http://plantpan.itps.ncku.edu.tw/index.html) and MAST tool from MEME suite (http://meme-suite.org/). However, you can adapt the scripts to work with other databases and motif search tools.
- Combine CRE search results with differential gene expression (DGE) analysis to predict the potential master-regulators among plant transcription factors (TFs);
- Infer certain potential TF families responsible for the differential regulation of genes belonging to the particular multigene families within which both up- and downregulated genes were well-represented.
- Multi-core CPU (for parallel computations)
- Linux OS is recommended (tested on Ubuntu 14.04)
- MEME suite is to be installed on your system (http://meme-suite.org/)
- R studio (is obligatory for automatic 'setwd()')
- R packages: data.table, ggplot2, ggpubr, grid, gridExtra, reshape2, XML
- Create an empty folder on your machine (name it 'Promoter-analysis' or whatever you like).
- Download 'Run_MAST', 'MAST_XML_parser', 'TF_family_regulons_correlation_analysis', and 'TF_regulons_enrichment_analysis' folders from this repository and put them into the folder you have created.
- Download the ID mapping file (all plants) from PlantPAN3.0. Put the file 'ID_mapping_all_plant.txt' into into the folder you have created. Alternative direct link to ID_mapping_all_plant.txt (2.8 MB)
- Use RimGubaev's script to extract promoters of your species' genes. Put the output file 'Promoters.fa' into 'Run_MAST' directory. Example output: Promoters.fa (56.3 MB)
- Download PlantPAN_TF_annotation_filtered.tsv, put it into 'MAST_XML_parser' folder.
- Download PlantPAN_meme_motifs (959 KB), put it into 'Run_MAST' folder.
- Download PlantPAN_TF_annotation_filtered.tsv, put it into 'Run_MAST' folder.
- Using bash shell, change current directory to 'Run_MAST' ($cd full_path_to_the_folder_created_in_step_1/Run_MAST)
- Run 'run_MAST_parallel.sh' ($bash run_MAST_parallel.bash). The output folder ('MAST_output') will appear in the current directory. Example output: MAST_output (1.34 GB).
- Open 'MAST_XML_parser.R' (is located in 'MAST_XML_parser') in R Studio and run this script. The output file ('mast_output_full.tsv') will appear in 'MAST_XML_parser' directory. Example output: mast_output_full.tsv (81.4 MB).
- Open 'Annotate_MAST_output_full.R' (is located in 'MAST_XML_parser') in R Studio and run this script. The output file ('tf_analysis_input_annotated.tsv') will appear in 'MAST_XML_parser' directory. Example output: tf_analysis_input_annotated.tsv (104.1 MB).
- Master-regulators prediction: put the table contains data on differential gene expression (DGE) into the folder you have created in the step 1. NB: the following columns must be in this table: GeneID (text or numeric), log2FC (numeric) (as shown below)
GeneID | log2FC |
---|---|
107809780 | 4.838 |
107760295 | -1.706 |
(example gene expression table 1 (2 MB)). If you want to use this example, change file's name to 'Expression_table.tsv'. Then open 'TF_regulons_enrichment_analysis.R ' (is located in 'TF_regulons_enrichment_analysis') in R Studio and run this script. The output file ('DEG_enriched_regulons.tsv') will appear in 'TF_regulons_enrichment_analysis' directory. Example output: DEG_enriched_regulons.tsv (174 B).
- Prediction of TF families responsible for regulation of a certain group of genes: put the table contains data on differential gene expression (DGE) into the folder you have created in the step 1. NB: differential genes you're interested in must be categorized in some way. For example, they may have GO terms or manually added groups:
GeneID | Gene_group | log2FC |
---|---|---|
107791722 | Zinc finger proteins | -9.473 |
107814985 | Aquaporins | 5.173 |
(example gene expression table 2 (categorized) (2 MB)). If you want to use this example, change file's name to 'Expression_table.tsv'.
Then open 'TF_family_regulons_correlation_analysis.R ' (is located in 'TF_family_regulons_correlation_analysis') in R Studio and run this script.The output png files will appear in 'Significant/Non_significant' directories. Example output:
If you're interested in how PlantPAN_TF_annotation_filtered.tsv and chunked PlantPAN_meme_motifs were produced, you may perform the following steps:
- Create an empty folder on your machine (name it 'Promoter-analysis' or whatever you like).
- Download all the folders from this repository into the folder you have created.
- Download PWMs of TF binding sites (all plants) from PlantPAN3.0. Put the file 'Transcription_factor_weight_matrix.txt' into into the folder you have created. Alternative direct link to Transcription_factor_weight_matrix.txt (1.1 MB)
- Download the ID mapping file (all plants) from PlantPAN3.0. Put the file 'ID_mapping_all_plant.txt' into into the folder you have created. Alternative direct link to ID_mapping_all_plant.txt (2.8 MB)
- Open 'PlantPAN_annotation_download.R' (is located in 'PlantPAN_annotation_download') in R Studio and run this script. The output file ('PlantPAN_TF_annotation_filtered.tsv') will appear in 'Output' directory. You may download this file here: PlantPAN_TF_annotation_filtered.tsv.
- Open 'Preparing_MAST_input.R' (is located in 'Preparing_MAST_input.R') in R Studio and run this script. The output ('PlantPAN_meme_motifs' folder) will appear in 'Preparing_MAST_input' directory. You may download it here: PlantPAN_meme_motifs (959 KB).
- Move 'PlantPAN_meme_motifs' into 'Run_MAST' folder.
- To perform further analysis, go to step 8 of the previous section.