GithubHelp home page GithubHelp logo

asadprodhan / gene_seq_extraction_from_multiple_genomes Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 127 KB

How do I automate extracting multiple gene sequences from multiple genomes?

Shell 100.00%
awk bash bedtools blastn

gene_seq_extraction_from_multiple_genomes's Introduction

How do I automate extracting multiple gene sequences from multiple genomes?

Asad Prodhan PhD

https://asadprodhan.github.io/


Required tools

  • Create a conda environment as follows

conda create -n gene_seq_extraction
  • Activate your conda environment

conda activate gene_seq_extraction
  • Install blast

conda install -c bioconda blast
  • Install bedtools

conda install -c bioconda bedtools

Methods


Prepare the genome sequences

I. Concatenate all the genomes into a single fasta file

cat *.fasta > concatenated_genomes.fasta

II. Convert the concatenated_genomes.fasta into a blast database

makeblastdb -in concatenated_genomes.fasta -out concatenated_genomes_db -dbtype 'nucl' -hash_index

III. Make a directory named "concatenated_genomes_db"

mkdir concatenated_genomes_db

IV. Move all the concatenated_genomes_db.* files into the concatenated_genomes_db directory

mv concatenated_genomes_db* concatenated_genomes_db 

Prepare the gene sequences

I. Name your genes of interest as follows

All gene sequence file names must have _gene.fasta extension

rsmD_NZ_CP065044_gene.fasta


Run the extraction

I. Put all the sequence files along with the follwoing script in the same directory

II. Run the following commands one-by-one

chmod +x *

This will make the files executable

dos2unix *

This will make sure that all the files are in unix format

III. Run the script as follows

DOWNLOAD the script here

./name-of-the-script.sh
#!/bin/bash
#

Red="$(tput setaf 1)"
Green="$(tput setaf 2)"
Bold=$(tput bold)
reset=`tput sgr0` # turns off all atribute

for file in *_gene.fasta
do
    echo ""
    echo ""
    echo "${Red}${Bold}Blastn ${reset}: "${file}"" 
    base=$(basename ${file} .fasta)
    blastn -query "${file}" -task blastn -db concatenated_genomes_db/concatenated_genomes_db -outfmt 6 -out ${base}_blastn_hits.tsv -evalue 1e-10 -num_threads 18 
    echo "${Green}${Bold}Done ${reset}: "${file}""
    echo ""
    echo "${Red}${Bold}Making bed file ${reset}: "${file}"" 
    awk '{print $2,$9,$10,""$1"_"NR}' ${base}_blastn_hits.tsv > ${base}_blastn_hits.bed
    echo "${Green}${Bold}Done ${reset}: "${file}""
    echo ""
    echo "${Red}${Bold}Sorting bed file ${reset}: "${file}"" 
    cat  ${base}_blastn_hits.bed | awk -v OFS='\t' '{ if ($2 < $3) {print $1,$2,$3,$4} else {print $1,$3,$2,$4} }' >  ${base}_blastn_hits_sorted.bed
    echo "${Green}${Bold}Done ${reset}: "${file}""
    echo ""
    echo "${Red}${Bold}Extracting blastn hit sequences ${reset}: "${file}"" 
    bedtools getfasta -fi concatenated_genomes.fasta -bed ${base}_blastn_hits_sorted.bed -fullHeader > ${base}_blastn_hits.fasta
    echo "${Green}${Bold}Done ${reset}: "${file}""
    echo ""
    echo "${Red}${Bold}Naming headers with the corresponding gene and genome names ${reset}: "${file}""
    awk -v var=$(basename $file .fasta) '{ gsub(/contig/, var "_contig") } 1' ${base}_blastn_hits.fasta > ${base}_blastn_hits_seqs.fasta
    echo "${Green}${Bold}Done ${reset}: "${file}""
    echo ""
    echo "${Red}${Bold}Removing the coordinates from the headers ${reset}: "${file}""
    sed -r '/^>/s/:[0-9]+-[0-9]+//' ${base}_blastn_hits_seqs.fasta > ${base}_blastn_hits_seqs_together.fasta
    echo "${Green}${Bold}Done ${reset}: "${file}""
    echo ""
    echo "${Red}${Bold}Splitting the blastn hit sequences into individual fasta files ${reset}: "${file}""
    while read line
    do
        if [[ ${line:0:1} == '>' ]] # files starting with '>'
        then
            outfile=$(echo "${line#>}" | cut -d ':' -f1)${baseName}.fasta # '${line#>}' is the enter heading, 
                                                                       # 'cut' separates parts based on space
                                                                       # '-f1' picks up the first part
                                                           
            echo $line > "$outfile"
        else
            echo $line >> "$outfile"
        fi
    done < ${base}_blastn_hits_seqs_together.fasta
    echo "${Green}${Bold}Done ${reset}: "${file}""
    rm -r *.bed *.tsv *_blastn_hits_seqs.fasta *_gene_blastn_hits.fasta
    echo ""
    echo "${Green}${Bold}Find the blastn hits as individual fasta files. However, _together.fasta contains all of them ${reset}"
    echo ""
    echo ""
done




Output files


Figure 1: Output files for the rsmD_NZ_CP065044 gene.


The following file (rsmD_NZ_CP065044_gene_blastn_hits_seqs.fasta) contains the rsmD_NZ_CP065044_gene sequences from all the supplied genomes

Note how the headers of the blastn hits have been named using the corresponding gene and genome names

This is for the convenience of tracking information


Figure 2. rsmD_NZ_CP065044 gene sequences extracted from all the supplied genomes

The end

gene_seq_extraction_from_multiple_genomes's People

Contributors

asadprodhan avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.