NITRATE-ENRICH-SHORT-READ

The steps taken to analyze the short reads generated by JGI for the PIE LTER fertilization experiment.

A Few words about things not detailed in this git

First we downloaded the data from JGI using GLOBUS which is the safest and best way to download the data. Then we began working on annotation of the short reads due to the poor assembly and binning acheived for these data. We are interested in generating the coverage information for specific genes related to biogeochemical cycling that are contained in the assebmly of each metagenomic dataset. This is a little tricky because, by collecting the functinoal genes from each of the individual assemblies, we will undoubtedly end up with many duplicate sequences, I think this is OK for now. During the creation of this git, I am working on the data here

/scratch/vineis.j/NITROGEN_ENRICH/GLOBUS-DOWNLOAD

however, after a time, I may move the data to a more permanent storage. here

/work/jennifer.bowen/

Generate a contigs database for each of the assemblies in the dataset using ANVIO. The steps below will also run all the marker gene searches for key functional genes within fungene and custom CO2 fixation genes.

The assemblies are usually found within the directory.. something like this

/scratch/vineis.j/NITROGEN_ENRICH/GLOBUS-DOWNLOAD/s_10CB15_MG/QC_and_Genome_Assembly/assembly.contigs.fasta

So to generate an anvio database, I use a slurm script that looks like the one below. The paths will be something that you will need to pay attention to.

#!/bin/bash
#
#SBATCH --nodes=1
#SBATCH --tasks-per-node=10
#SBATCH --mem=100Gb
#SBATCH --time=07:00:00
#SBATCH --partition=short
#SBATCH --array=1-62

SAMPLE=$(sed -n "$SLURM_ARRAY_TASK_ID"p sample_names.txt)
anvi-gen-contigs-database -f s_${SAMPLE}/QC_and_Genome_Assembly/assembly.contigs.fasta -o x_ANVIO-assembly-dbs/s_${SAMPLE}.db
anvi-run-hmms -c x_ANVIO-assembly-dbs/s_${SAMPLE}.db -T 30
anvi-run-hmms -c x_ANVIO-assembly-dbs/s_${SAMPLE}.db -H ~/scripts/databas/HMM_co2fix/ -T 30
anvi-run-hmms -c x_ANVIO-assembly-dbs/s_${SAMPLE}.db -H /work/jennifer.bowen/DBs/all_fungene_anvio/ -T 30
anvi-run-hmms -c x_ANVIO-assembly-dbs/s_${SAMPLE}.db -H /work/jennifer.bowen/DBs/anvio_cazy/ -T 30

To export a fasta file of each of the functional genes, you can use the steps below. You will need to have a "x_gene-names.txt" file that contains all of the genes that you want to export from the anvio database. There is an example of that file contained in this repository. You will also need an "x-external-genomes.txt" file that is also contained in this repository. You have made a ton of these files in the past and this should be a piece of cake if you are an experienced Anvio user. If you are not, you should learn more about Anvio from their amazing tutorials. Here is the script to export the genes.

#!/bin/bash
#
#SBATCH --nodes=1
#SBATCH --tasks-per-node=10
#SBATCH --mem=100Gb
#SBATCH --time=07:00:00
#SBATCH --partition=short
#SBATCH --array=1-92

SAMPLE=$(sed -n "$SLURM_ARRAY_TASK_ID"p x_gene-names.txt)
anvi-get-sequences-for-hmm-hits -e x-external-genomes.txt --get-aa-sequences -o x_${SAMPLE}-sequences.faa --gene-names ${SAMPLE}

anvio will leave a space in the export name and this needs to be fixed.. I use the following to fix it.

for i in *.faa; do sed -i 's/ bin/_bin/g' $i; done

Now for the mapping of each sample to each assembly. This is a monster task and you cannot achieve this on a laptop (at least I don't think so). You will need bbmap for this task and you will need to run it separately for each of the samples so that you don't steal all the resources. This is how I run it for one of the samples.

#!/bin/bash
#
#SBATCH --nodes=1
#SBATCH --tasks-per-node=10
#SBATCH --mem=200Gb
#SBATCH --time=24:00:00
#SBATCH --partition=short
#SBATCH --array=1-62

SAMPLE=$(sed -n "$SLURM_ARRAY_TASK_ID"p sample_names.txt)

var=*
echo "$var"
var="$(echo *)"

ref_file=s_10CB15_MT/QC_and_Genome_Assembly/assembly.contigs.fasta
fastq_file=$( echo "s_${SAMPLE}/Filtered_Raw_Data/"*"fastq.gz")
output_file=MAPPING/s_10CB15_MT-vs-${SAMPLE}.bam
covstats_file=MAPPING/s_10CB15_MT-vs-${SAMPLE}-covstats.txt

bbmap.sh threads=4 nodisk=true interleaved=true ambiguous=random in=${fastq_file} ref=${ref_file} out=${output_file} covstats=${covstats_file} bamscript=to_bam.sh

Then you can use the resulting "covstats.txt" file to look at the coverage of each gene derived from a single sample across all samples. Friggen awesome! Right!?!?! Here is how you would run that analysis.

first you need to run a script that will generate the sbatch scripts that will collect the coverage of a given gene. You will need a file called "x_metagenomic-sample-names.txt" which will contains the list of metagenomic samples (there are transcriptome samples as well, and I don't want to use those assemblies for this part of the project) and in this case, each one has a "s_" prior to the sample name. This is the script that will create a sbatch script for each sample.
```
 for i in `cat x_metagenomic-sample-names.txt`; do python ~/scripts/create-tabulate-gene-coverage-script.py ${i}; done
```
Now that you have a bunch of bash scripts in the directory, its time to run them. Its important to load them in batches if you have more than 25 samples so that you don't overload the system. You can simply use a couple of lists that contain subsets of your samples. Get the first batch going, wait a bit, and then load the second batch.
```
for i in `cat x_samples-1.txt`; do sbatch ${i}_tabulate-genecoverage.shx; done
```

These bash scripts will generate a coverage file for each of the pairwise mapping covstats.txt files in your "MAPPING" directory. For example, there will be a ton of files named something like this s_5SB15_MG-vs-8NB15_MT-nirS.txt looking something like this.

scaffold	MAPPING/s_5SB15_MG-vs-8CB15_MG-covstats.txt
s_5SB15_MG_scaffold_3482_c1	12.3445
s_5SB15_MG_scaffold_8419_c1	2.7756
s_5SB15_MG_scaffold_10151_c1	7.5100
s_5SB15_MG_scaffold_10327_c1	14.3237
s_5SB15_MG_scaffold_11293_c1	14.6181
s_5SB15_MG_scaffold_12271_c1	2.4394

This of course is the coverage of each nirS scaffold found in sample s_5SB15_MG by the short reads derived from s_8CB15_MG. There are a ton of these and we need to put them all together.

Lets put together all of the coverage information for a single gene of interest using another great little scripty and run it like this

#!/bin/bash
#
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --mem=100Gb
#SBATCH --time=00:20:00
#SBATCH --partition=express

for i in `cat x_sample-names.txt`; do python ~/scripts/combine-coverage-from-hmms-and-covstats.py --gene nirS --path /scratch/vineis.j/NITROGEN_ENRICH/GLOBUS-DOWNLOAD/ --out ${i}-nirS.txt --sample ${i}-; done

This will create a file that contains each of the nirS sequences found within all scaffolds and the coverage of each of the scaffolds within each individual sample.

Now combine all of the tables. You can do this with excel or unix. I'll put together a script that combines tables and normalizes based on the number of reads per sample. THIS IS A CRITICAL NORMALIZATION.. Do not analyzed your data without it. Here is one way to do this for nirS

From the directory containing all of the nirS.txt files created above run this to concatenate them.
```
 cat *nirS.txt > x_ALL-nirS-NENRICH.txt
```
Fix the table so that you don't have a header for each of the samples. For now, just open in excel and remove all but a single header line. Then run the script below. This script requires that you have a two column file with the sample name in column 1 and read cound in column 2.

The top of my file looks like this

s_10CB15_MG	100415802
s_10CB15_MT	21822096
s_10CP15_MG	85530616
s_10CP15_MT	25864552
s_10CT5_MG	133443700
s_10CT5_MT	12044266
s_10NB15_MT	19241118

The top of the nirS concatenated file of mapping to each sample.

sample	nirS	nirS_source	nirS_comparison	nirS_omics	nirS_comparison_depth
s_8WT15_MG-vs-10CB15_MT-covstats.txt	nirS	8WT15	10CB15	MT	167.186
s_8WT15_MG-vs-10CB15_MG-covstats.txt	nirS	8WT15	10CB15	MG	540.1029
s_8WP15_MG-vs-10CB15_MT-covstats.txt	nirS	8WP15	10CB15	MT	117.1182

You can run the script below if your data are exactly structured as mine. Otherwise it will take some adjustment. This part could be improved and I'll work on it in the future.

python ~/scripts/correct-depth-of-coverage-by-read-count.py x_ALL-nirS-NENRICH.txt ../filtered-read-count-per-sample.txt x_ALL-nirS-NENRICH-normalized.txt

The resulting file "x_ALL-nirS-NENRICH-normalized.txt" contains a last column that can be discussed as the average coverage per million reads. It was calculated by dividing the average coverage of the scaffold that contains the gene of interest divided by the total number of reads in the dataset multiplied by 1,000,000.

This is one way to conduct the cazy analysis... however, there is a simplified way outlined below.

CAZY analysis.. One set of genes that are NOT included in our hmms are carbohydrate use. Not to worry though, we have a really nice database for this which can be found here (/work/jennifer.bowen/DBs/CAZY/) and was derived from here (http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN-old@UGA/) and are described here (http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN-old@UGA/readme.txt). The script that I used to run the analysis works on the assemblies generated by JGI and contained in each individual sample directory on discovery here (/scratch/vineis.j/NITROGEN_ENRICH/GLOBUS-DOWNLOAD) but will eventually end up in the BOWEN work directory here (/work/jennifer.bowen/JOE/). Here is what the script looks like.

#!/bin/bash
#
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --mem=80GB
#SBATCH --time=05:00:00
#SBATCH --partition=short
#SBATCH --array=1-62

SAMPLE=$(sed -n "$SLURM_ARRAY_TASK_ID"p sample_names.txt)

##This step may have been completed for various other reasons.. check to see if the dbs already exist before running another one

anvi-gen-contigs-database -f s_${SAMPLE}/QC_and_Genome_Assembly/assembly.contigs.fasta -o x_ANVIO-assembly-dbs/s_${SAMPLE}.db

anvi-export-gene-calls -c x_ANVIO-assembly-dbs/s_${SAMPLE}.db -o x_ANVIO-assembly-dbs/s_${SAMPLE}-prodigal.txt --gene-caller prodigal

python x_convert-anvio-prodigal-hits-to-faa.py --i x_ANVIO-assembly-dbs/s_${SAMPLE}-prodigal.txt --o x_ANVIO-assembly-dbs/s_${SAMPLE}-prodigal.faa


### Make sure that you load hmmer prior to running the script with the commands below active (without a "#" in front of their name)

hmmscan --domtblout x_ANVIO-assembly-dbs/s_${SAMPLE}-cazy-out.dm /work/jennifer.bowen/DBs/CAZY/dbCAN-fam-HMMs.txt x_ANVIO-assembly-dbs/s_${SAMPLE}-prodigal.faa > x_ANVIO-assembly-dbs/s_${SAMPLE}-cazy.out

bash /work/jennifer.bowen/DBs/CAZY/hmmscan-parser.sh x_ANVIO-assembly-dbs/s_${SAMPLE}-cazy-out.dm > x_ANVIO-assembly-dbs/s_${SAMPLE}-cazy-out-dm.ps

cat x_ANVIO-assembly-dbs/s_${SAMPLE}-cazy-out-dm.ps | awk '$5<1e-15&&$10>0.35' > x_ANVIO-assembly-dbs/s_${SAMPLE}-cazy-stringent-hits.txt

You can't run the whole thing at once... You will need to run the anvio portion first, which will generate a contigs database and export all prodigal gene calls in the database. When this is complete, you will can use hmmer to run the search for carbohydrate use among all of your contigs. Don't forget that you will need to pay attention to your conda environments when running the bash command because the first part needs and anvio environment and the second half needs the hmmer environment.

So, now you have a file for each sample ending with "cazy-stringent-hits.txt". You need to combine these into a single matrix of presence absence in order to get an idea for the composition of carbohydrate use in each of your samples. You can use the script below to make this happen.

combine-cazy-tables.py sample_names.txt

I ran this script from here (/scratch/vineis.j/NITROGEN_ENRICH/GLOBUS-DOWNLOAD/x_ANVIO-assembly-dbs) and the python script and sample_names.txt file can be found in this git repository

A simpler way to run the cazyme and the fungene analysis is found below: this is how I plan to run it in the future. This approach requires that you have created the anvio database for each of your assemblies and run the hmms for the cazyme and fungene hmms. This can be accomplished with a single script below if you haven't already done this step. You can find examples of the "sample_names.txt" file in this repository. Just be sure to specify the correct path in the "anvi-gen-contigs-database" to your different assemblies :). If this doesn't make sense today, try again tomorrow.

#!/bin/bash
#
#SBATCH --nodes=1
#SBATCH --tasks-per-node=10
#SBATCH --mem=100Gb
#SBATCH --time=07:00:00
#SBATCH --partition=short
#SBATCH --array=1-62

SAMPLE=$(sed -n "$SLURM_ARRAY_TASK_ID"p sample_names.txt)
anvi-gen-contigs-database -f s_${SAMPLE}/QC_and_Genome_Assembly/assembly.contigs.fasta -o x_ANVIO-assembly-dbs/s_${SAMPLE}.db
anvi-run-hmms -c x_ANVIO-assembly-dbs/s_${SAMPLE}.db -T 30
anvi-run-hmms -c x_ANVIO-assembly-dbs/s_${SAMPLE}.db -H ~/scripts/databas/HMM_co2fix/ -T 30
anvi-run-hmms -c x_ANVIO-assembly-dbs/s_${SAMPLE}.db -H /work/jennifer.bowen/DBs/all_fungene_anvio/ -T 30
anvi-run-hmms -c x_ANVIO-assembly-dbs/s_${SAMPLE}.db -H /work/jennifer.bowen/DBs/anvio_cazy/ -T 30

Now export all of the genes for cazy hits.

anvi-get-sequences-for-hmm-hits -e x-external-genomes.txt --hmm-sources anvio_cazy -o x_ALL-ANVIO-CAZY.faa

Then fix the headers in the resulting fasta file.

sed 's/ bin/_bin/g' x_ALL-ANVIO-CAZY.faa > fix
mv fix x_ALL-ANVIO-CAZY.faa

Now for the best part!!! Run the script below to create an amazing matrix of the genes in your exported cazy genes... so beautiful.

#!/bin/bash
#
#SBATCH --nodes=1
#SBATCH --time=12:00:00
#SBATCH --tasks-per-node=10
#SBATCH --partition=short
#SBATCH --mem=100Gb

python ~/scripts/calculate-gene-coverage-from-faa-and-covstats.py -genes x_ALL-ANVIO-CAZY.faa -rc NITROGEN-ENRICH-METADATA.txt -o x_ALL-NORMALIZED-GENE-MATRIX.txt -cov /scratch/vineis.j/NITROGEN_ENRICH/GLOBUS-DOWNLOAD/MAPPING/

This has created a matrix containing the genes as rows and the samples as columns. The values within each of the cells was calculated by estimating the number of reads mapping to each of the genes within your "x_ALL-ANVIO-CAZY.faa" file based on the mapping results contained in the coverage stats table created duing the bbmap step. Oh. You don't know what I'm talking about because I didn't describe very well the script that I used to do this when only mapping to the source reads. Above, I conducted and all vs all approach, mapping each sample back to each of the assemblies. Here is how I did it.

#!/bin/bash
#
#SBATCH --nodes=1
#SBATCH --tasks-per-node=10
#SBATCH --mem=200Gb
#SBATCH --time=24:00:00
#SBATCH --partition=short
#SBATCH --array=1-62

SAMPLE=$(sed -n "$SLURM_ARRAY_TASK_ID"p x_all-samples.txt)

ref_file=${SAMPLE}/QC_and_Genome_Assembly/assembly.contigs.fasta
fastq_file=$( echo "${SAMPLE}/Filtered_Raw_Data/"*"fastq.gz")
output_file=MAPPING/${SAMPLE}-vs-${SAMPLE}.bam
covstats_file=MAPPING/${SAMPLE}-vs-${SAMPLE}-covstats.txt

bbmap.sh threads=4 nodisk=true interleaved=true ambiguous=random in=${fastq_file} ref=${ref_file} out=${output_file} covstats=${covstats_file} bamscript=to_bam.sh

###ANVIO PLOTS of the matrices created above.

jvineis / nitrate-enrich-short-read Goto Github PK