GithubHelp home page GithubHelp logo

gawn's People

Contributors

clairemerot avatar enormandeau avatar rsbrennan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

gawn's Issues

Confirm that use of BLAST's `-max_target_seqs` is intentional

Hi there,

This is a semi-automated message from a fellow bioinformatician. Through a GitHub search, I found that the following source files make use of BLAST's -max_target_seqs parameter:

Based on the recently published report, Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows, there is a strong chance that this parameter is misused in your repository.

If the use of this parameter was intentional, please feel free to ignore and close this issue but I would highly recommend to add a comment to your source code to notify others about this use case. If this is a duplicate issue, please accept my apologies for the redundancy as this simple automation is not smart enough to identify such issues.

Thank you!
-- Arman (armish/blast-patrol)

GAWN Broken

GAWN is pretty unstable right now with regards to what tool versions you are using.

Following an OS update to Linux Mint 18.2 GAWN stopped working after I re-installed the dependencies.

Bash problem during "Annotating transcriptome with swissprot"

I am using the updated master branch, having changed only gawn_config.txt

$ ./gawn 02_infos/gawn_config.sh
 ---------------------------------------------- 
GAWN v0.3 - Genome Annotation Without Nightmares
 ---------------------------------------------- 

02_infos/gawn_config.sh

GAWN: Skipping genome indexing
 _______________________________________________________________________
/                                                                       \
| GAWN: Annotating genome with transcriptome
| --------------------------------------------------------------------- |
|                                                                       |
\_______________________________________________________________________/

 _______________________________________________________________________
/                                                                       \
| GAWN: Annotating transcriptome with swissprot
| --------------------------------------------------------------------- |
parallel: Warning: A record was longer than 1000. Increasing to --blocksize 1301.
/bin/bash: P33231: command not found
/bin/bash: LLDP_ECOLI.txt: command not found
/bin/bash: 14007330: command not found
/bin/bash: BG713380.1: command not found
/bin/bash: gb: command not found
/bin/bash: .info: command not found
/bin/bash: P59843: command not found
/bin/bash: gb: command not found
/bin/bash: .info: command not found
/bin/bash: BI067949.1: command not found
/bin/bash: 14475471: command not found
/bin/bash: INSB_HAEDU.txt: command not found
/bin/bash: P33231: command not found
/bin/bash: .info: command not found
/bin/bash: LLDP_ECOLI.txt: command not found
/bin/bash: gb: command not found
/bin/bash: BG713331.1: command not found
/bin/bash: 14007281: command not found
/bin/bash: .info: command not found
/bin/bash: BG713670.1: command not found
/bin/bash: gb: command not found
/bin/bash: 14007620: command not found
/bin/bash: GLK_ECOSE.txt: command not found
/bin/bash: B6I6T9: command not found
/bin/bash: BG710168.1: command not found
/bin/bash: .info: command not found
/bin/bash: gb: command not found
/bin/bash: 14004118: command not found
/bin/bash: P0ACV3: command not found
/bin/bash: LPXP_SHIFL.txt: command not found
/bin/bash: P0ACV3: command not found
/bin/bash: 14006980: command not found
/bin/bash: BG713030.1: command not found
/bin/bash: .info: command not found
/bin/bash: gb: command not found
/bin/bash: LPXP_SHIFL.txt: command not found
/bin/bash: E0IWI3: command not found
/bin/bash: DCDA_ECOLW.txt: command not found
/bin/bash: gb: command not found
/bin/bash: GE310270.1: command not found
/bin/bash: .info: command not found
/bin/bash: 209377782: command not found
/bin/bash: B5Z2P7: command not found
/bin/bash: BGAL_ECO5E.txt: command not found
/bin/bash: 226767304: command not found
/bin/bash: gb: command not found
/bin/bash: GO523315.1: command not found
/bin/bash: .info: command not found
/bin/bash: gb: command not found
/bin/bash: .info: command not found
/bin/bash: BG712708.1: command not found
/bin/bash: P0ACV3: command not found
/bin/bash: 14006658: command not found
/bin/bash: LPXP_SHIFL.txt: command not found
|                                                                       |
\_______________________________________________________________________/

 _______________________________________________________________________
/                                                                       \
| GAWN: Annotating genome using transcriptome
| --------------------------------------------------------------------- |
|                                                                       |
\_______________________________________________________________________/

Any ideas on why this might be happening?
Thanks

blastx: Error: (CArgException::eSynopsis) Too many positional arguments (1), the offending value: -

:$ /usr/bin/time -v ./gawn 02_infos/gawn_config.sh 

 ---------------------------------------------- 
GAWN v0.3 - Genome Annotation Without Nightmares
 ---------------------------------------------- 

02_infos/gawn_config.sh

GAWN: Skipping genome indexing
 _______________________________________________________________________
/                                                                       \
| GAWN: Annotating genome with transcriptome
| --------------------------------------------------------------------- |
|                                                                       |
\_______________________________________________________________________/

 _______________________________________________________________________
/                                                                       \
| GAWN: Annotating transcriptome with swissprot
| --------------------------------------------------------------------- |
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.

parallel: Warning: A record was longer than 1000. Increasing to --blocksize 1301.
USAGE
  blastx [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]
    [-subject_loc range] [-query input_file] [-out output_file]
    [-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
    [-gapextend extend_penalty] [-qcov_hsp_perc float_value]
    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-sum_stats bool_value] [-max_intron_length length] [-seg SEG_options]
    [-soft_masking soft_masking] [-matrix matrix_name]
    [-threshold float_value] [-culling_limit int_value]
    [-best_hit_overhang float_value] [-best_hit_score_edge float_value]
    [-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range]
    [-strand strand] [-parse_deflines] [-query_gencode int_value]
    [-outfmt format] [-show_gis] [-num_descriptions int_value]
    [-num_alignments int_value] [-line_length line_length] [-html]
    [-max_target_seqs num_sequences] [-num_threads int_value] [-remote]
    [-comp_based_stats compo] [-use_sw_tback] [-version]

DESCRIPTION
   Translated Query-Protein Subject BLAST 2.7.1+

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Error: Too many positional arguments (1), the offending value: -
Error:  (CArgException::eSynopsis) Too many positional arguments (1), the offending value: -
USAGE
  blastx [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]
    [-subject_loc range] [-query input_file] [-out output_file]
    [-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
    [-gapextend extend_penalty] [-qcov_hsp_perc float_value]
    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-sum_stats bool_value] [-max_intron_length length] [-seg SEG_options]
    [-soft_masking soft_masking] [-matrix matrix_name]
    [-threshold float_value] [-culling_limit int_value]
    [-best_hit_overhang float_value] [-best_hit_score_edge float_value]
    [-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range]
    [-strand strand] [-parse_deflines] [-query_gencode int_value]
    [-outfmt format] [-show_gis] [-num_descriptions int_value]
    [-num_alignments int_value] [-line_length line_length] [-html]
    [-max_target_seqs num_sequences] [-num_threads int_value] [-remote]
    [-comp_based_stats compo] [-use_sw_tback] [-version]

DESCRIPTION
   Translated Query-Protein Subject BLAST 2.7.1+

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Error: Too many positional arguments (1), the offending value: -
Error:  (CArgException::eSynopsis) Too many positional arguments (1), the offending value: -
USAGE
  blastx [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]
    [-subject_loc range] [-query input_file] [-out output_file]
    [-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
    [-gapextend extend_penalty] [-qcov_hsp_perc float_value]
    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-sum_stats bool_value] [-max_intron_length length] [-seg SEG_options]
    [-soft_masking soft_masking] [-matrix matrix_name]
    [-threshold float_value] [-culling_limit int_value]
    [-best_hit_overhang float_value] [-best_hit_score_edge float_value]
    [-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range]
    [-strand strand] [-parse_deflines] [-query_gencode int_value]
    [-outfmt format] [-show_gis] [-num_descriptions int_value]
    [-num_alignments int_value] [-line_length line_length] [-html]
    [-max_target_seqs num_sequences] [-num_threads int_value] [-remote]
    [-comp_based_stats compo] [-use_sw_tback] [-version]

DESCRIPTION
   Translated Query-Protein Subject BLAST 2.7.1+

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Error: Too many positional arguments (1), the offending value: -
Error:  (CArgException::eSynopsis) Too many positional arguments (1), the offending value: -
USAGE
  blastx [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]
    [-subject_loc range] [-query input_file] [-out output_file]
    [-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
    [-gapextend extend_penalty] [-qcov_hsp_perc float_value]
    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-sum_stats bool_value] [-max_intron_length length] [-seg SEG_options]
    [-soft_masking soft_masking] [-matrix matrix_name]
    [-threshold float_value] [-culling_limit int_value]
    [-best_hit_overhang float_value] [-best_hit_score_edge float_value]
    [-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range]
    [-strand strand] [-parse_deflines] [-query_gencode int_value]
    [-outfmt format] [-show_gis] [-num_descriptions int_value]
    [-num_alignments int_value] [-line_length line_length] [-html]
    [-max_target_seqs num_sequences] [-num_threads int_value] [-remote]
    [-comp_based_stats compo] [-use_sw_tback] [-version]

DESCRIPTION
   Translated Query-Protein Subject BLAST 2.7.1+

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Error: Too many positional arguments (1), the offending value: -
Error:  (CArgException::eSynopsis) Too many positional arguments (1), the offending value: -
USAGE
  blastx [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]
    [-subject_loc range] [-query input_file] [-out output_file]
    [-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
    [-gapextend extend_penalty] [-qcov_hsp_perc float_value]
    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-sum_stats bool_value] [-max_intron_length length] [-seg SEG_options]
    [-soft_masking soft_masking] [-matrix matrix_name]
    [-threshold float_value] [-culling_limit int_value]
    [-best_hit_overhang float_value] [-best_hit_score_edge float_value]
    [-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range]
    [-strand strand] [-parse_deflines] [-query_gencode int_value]
    [-outfmt format] [-show_gis] [-num_descriptions int_value]
    [-num_alignments int_value] [-line_length line_length] [-html]
    [-max_target_seqs num_sequences] [-num_threads int_value] [-remote]
    [-comp_based_stats compo] [-use_sw_tback] [-version]

DESCRIPTION
   Translated Query-Protein Subject BLAST 2.7.1+

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Error: Too many positional arguments (1), the offending value: -
Error:  (CArgException::eSynopsis) Too many positional arguments (1), the offending value: -
USAGE
  blastx [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]
    [-subject_loc range] [-query input_file] [-out output_file]
    [-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
    [-gapextend extend_penalty] [-qcov_hsp_perc float_value]
    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-sum_stats bool_value] [-max_intron_length length] [-seg SEG_options]
    [-soft_masking soft_masking] [-matrix matrix_name]
    [-threshold float_value] [-culling_limit int_value]
    [-best_hit_overhang float_value] [-best_hit_score_edge float_value]
    [-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range]
    [-strand strand] [-parse_deflines] [-query_gencode int_value]
    [-outfmt format] [-show_gis] [-num_descriptions int_value]
    [-num_alignments int_value] [-line_length line_length] [-html]
    [-max_target_seqs num_sequences] [-num_threads int_value] [-remote]
    [-comp_based_stats compo] [-use_sw_tback] [-version]

DESCRIPTION
   Translated Query-Protein Subject BLAST 2.7.1+

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Error: Too many positional arguments (1), the offending value: -
Error:  (CArgException::eSynopsis) Too many positional arguments (1), the offending value: -
USAGE
  blastx [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]
    [-subject_loc range] [-query input_file] [-out output_file]
    [-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
    [-gapextend extend_penalty] [-qcov_hsp_perc float_value]
    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-sum_stats bool_value] [-max_intron_length length] [-seg SEG_options]
    [-soft_masking soft_masking] [-matrix matrix_name]
    [-threshold float_value] [-culling_limit int_value]
    [-best_hit_overhang float_value] [-best_hit_score_edge float_value]
    [-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range]
    [-strand strand] [-parse_deflines] [-query_gencode int_value]
    [-outfmt format] [-show_gis] [-num_descriptions int_value]
    [-num_alignments int_value] [-line_length line_length] [-html]
    [-max_target_seqs num_sequences] [-num_threads int_value] [-remote]
    [-comp_based_stats compo] [-use_sw_tback] [-version]

DESCRIPTION
   Translated Query-Protein Subject BLAST 2.7.1+

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Error: Too many positional arguments (1), the offending value: -
Error:  (CArgException::eSynopsis) Too many positional arguments (1), the offending value: -
USAGE
  blastx [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]
    [-subject_loc range] [-query input_file] [-out output_file]
    [-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
    [-gapextend extend_penalty] [-qcov_hsp_perc float_value]
    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-sum_stats bool_value] [-max_intron_length length] [-seg SEG_options]
    [-soft_masking soft_masking] [-matrix matrix_name]
    [-threshold float_value] [-culling_limit int_value]
    [-best_hit_overhang float_value] [-best_hit_score_edge float_value]
    [-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range]
    [-strand strand] [-parse_deflines] [-query_gencode int_value]
    [-outfmt format] [-show_gis] [-num_descriptions int_value]
    [-num_alignments int_value] [-line_length line_length] [-html]
    [-max_target_seqs num_sequences] [-num_threads int_value] [-remote]
    [-comp_based_stats compo] [-use_sw_tback] [-version]

DESCRIPTION
   Translated Query-Protein Subject BLAST 2.7.1+

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Error: Too many positional arguments (1), the offending value: -
Error:  (CArgException::eSynopsis) Too many positional arguments (1), the offending value: -
USAGE
  blastx [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]
    [-subject_loc range] [-query input_file] [-out output_file]
    [-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
    [-gapextend extend_penalty] [-qcov_hsp_perc float_value]
    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-sum_stats bool_value] [-max_intron_length length] [-seg SEG_options]
    [-soft_masking soft_masking] [-matrix matrix_name]
    [-threshold float_value] [-culling_limit int_value]
    [-best_hit_overhang float_value] [-best_hit_score_edge float_value]
    [-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range]
    [-strand strand] [-parse_deflines] [-query_gencode int_value]
    [-outfmt format] [-show_gis] [-num_descriptions int_value]
    [-num_alignments int_value] [-line_length line_length] [-html]
    [-max_target_seqs num_sequences] [-num_threads int_value] [-remote]
    [-comp_based_stats compo] [-use_sw_tback] [-version]

DESCRIPTION
   Translated Query-Protein Subject BLAST 2.7.1+

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Error: Too many positional arguments (1), the offending value: -
Error:  (CArgException::eSynopsis) Too many positional arguments (1), the offending value: -
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.

|                                                                       |
\_______________________________________________________________________/

 _______________________________________________________________________
/                                                                       \
| GAWN: Annotating genome using transcriptome
| --------------------------------------------------------------------- |
|                                                                       |
\_______________________________________________________________________/

	Command being timed: "./gawn 02_infos/gawn_config.sh"
	User time (seconds): 0.42
	System time (seconds): 0.18
	Percent of CPU this job got: 103%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.58
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 28816
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 4
	Minor (reclaiming a frame) page faults: 43291
	Voluntary context switches: 466
	Involuntary context switches: 111
	Swaps: 0
	File system inputs: 168
	File system outputs: 9192
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

makeblastdb missing?

Currently the use of $SWISSPROT_DB variable is ambiguous. In gawn_config.sh example it seems to be a path but makeblastdb hasn't been called, so blastx call will raise an error. If the user is supposed to run the makeblastdb command beforehand, then it should be mentioned in README.md file

Producing proteome and transcriptome fasta files

Hello,
First of all, I'd like to say thanks for developing GAWN - I find it very useful and easy to use.
I wanted to ask for advice and/or suggest a new feature.
It would be nice if, apart from the gff output, GAWN will also produce fasta files containing transcript and protein sequences of the predicted genes (similar to MAKER's output).
It seems like gffread can produce transcript sequences, which can then be used to produce protein sequences (maybe with TransDecoder). So maybe you can incorporate this logic into the pipeline or perhaps do something smarter.
Would you say that the strategy I suggested is valid for obtaining transcripts and proteins?
Thanks!

Cannot locate cdna.transdecoder_dir & hangup error at blastx

$ cat 02_infos/gawn_config.sh 
#!/bin/bash

# Modify the following parameter values according to your experiment
# Do not modify the parameter names or remove parameters
# Do not add spaces around the equal (=) sign

# Global parameters
NCPUS=2                    # Number of CPUs to use for analyses (int, 1+)

# Genome indexing
SKIP_GENOME_INDEXING=0      # 1 to skip genome indexing, 0 to index it

# Genome annotation with transcriptome
# NOTE: do not use compressed fasta files
GENOME_NAME="SRR001665_contigs_greater200.fasta"  # Name of genome fasta file found in 03_data
TRANSCRIPTOME_NAME="evidence.fasta"    # Name of transcriptome fasta file found in 03_data

# Swissprot
SWISSPROT_DB="uniprot_sprot.db"
$ /usr/bin/time -v ./gawn 02_infos/gawn_config.sh &>stdout
$ cat stdout

 ----------------------------------------- 
GAWN - Genome Annotation Without Nightmares
 ----------------------------------------- 

02_infos/gawn_config.sh
 _______________________________________________________________________
/                                                                       \
| \nGAWN: Indexing genome
| --------------------------------------------------------------------- |
-k flag not specified, so building with default 15-mers
Sorting chromosomes in chrom order.  To turn off or sort other ways, use the -s flag.
Creating files in directory 03_data/indexed_genome
Running "/usr/lib/gmap/fa_coords"     -o "03_data/indexed_genome.coords" -f "03_data/indexed_genome.sources"
Opening file 03_data/SRR001665_contigs_greater200.fasta
  Processed short contigs (<1000000 nt): ...................................................................................................More than 100 short contigs.  Will stop printing.

============================================================
Contig mapping information has been written to file 03_data/indexed_genome.coords.
You should look at this file, and edit it if necessary
If everything is okay, you should proceed by running
    make gmapdb
============================================================
Running "/usr/lib/gmap/gmap_process"  -c "03_data/indexed_genome.coords" -f "03_data/indexed_genome.sources" | "/usr/lib/gmap/gmapindex"  -d indexed_genome -D "03_data/indexed_genome" -A 
Reading coordinates from file 03_data/indexed_genome.coords
Logging contig contig_10293114 at contig_10293114:1..461 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_1990817 at contig_1990817:1..803 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_2825224 at contig_2825224:1..2671 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_3235758 at contig_3235758:1..7205 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_4165115 at contig_4165115:1..6777 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_4270659 at contig_4270659:1..13060 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_4302725 at contig_4302725:1..12221 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_5182773 at contig_5182773:1..3558 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_5850706 at contig_5850706:1..1061 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_6139417 at contig_6139417:1..15743 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_6922011 at contig_6922011:1..7200 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7040037 at contig_7040037:1..12701 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_767663 at contig_767663:1..11555 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_992401 at contig_992401:1..4328 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_10263289 at contig_10263289:1..9800 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_1275019 at contig_1275019:1..2348 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_1467656 at contig_1467656:1..4439 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_2913181 at contig_2913181:1..10354 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_6393479 at contig_6393479:1..419 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_6916793 at contig_6916793:1..1939 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7048192 at contig_7048192:1..8206 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7852423 at contig_7852423:1..4287 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7993995 at contig_7993995:1..1449 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_8425974 at contig_8425974:1..7394 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_8558094 at contig_8558094:1..1897 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_8650055 at contig_8650055:1..9126 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_9296210 at contig_9296210:1..27794 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_9392475 at contig_9392475:1..3785 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_9594499 at contig_9594499:1..5134 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_2479530 at contig_2479530:1..1045 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_5659803 at contig_5659803:1..15580 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_6019695 at contig_6019695:1..6133 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_6409070 at contig_6409070:1..26765 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_682563 at contig_682563:1..216 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_6907888 at contig_6907888:1..2049 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7146251 at contig_7146251:1..249 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7192674 at contig_7192674:1..10791 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7207442 at contig_7207442:1..13078 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7235720 at contig_7235720:1..2256 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_8775926 at contig_8775926:1..2871 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_9091474 at contig_9091474:1..14820 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_10181993 at contig_10181993:1..7049 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_1426058 at contig_1426058:1..7386 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_1811339 at contig_1811339:1..7447 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_2951315 at contig_2951315:1..2669 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_3399947 at contig_3399947:1..2087 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_3959747 at contig_3959747:1..280 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7193903 at contig_7193903:1..6273 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7282878 at contig_7282878:1..4505 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7681370 at contig_7681370:1..14144 in genome indexed_genome
 => primary (linear) chromosome
More than 50 contigs.  Will stop printing messages
Total genomic length = 4542792 bp
Have a total of 509 chromosomes
Writing chromosome file 03_data/indexed_genome/indexed_genome.chromosome
Chromosome contig_24343 has universal coordinates 1..251
Chromosome contig_45221 has universal coordinates 252..705
Chromosome contig_53152 has universal coordinates 706..8434
Chromosome contig_176392 has universal coordinates 8435..9217
Chromosome contig_235461 has universal coordinates 9218..49727
Chromosome contig_356264 has universal coordinates 49728..50437
Chromosome contig_359465 has universal coordinates 50438..55046
Chromosome contig_372625 has universal coordinates 55047..57797
Chromosome contig_377765 has universal coordinates 57798..64707
Chromosome contig_407593 has universal coordinates 64708..68126
Chromosome contig_414500 has universal coordinates 68127..75840
Chromosome contig_449583 has universal coordinates 75841..105061
Chromosome contig_460419 has universal coordinates 105062..109783
Chromosome contig_486889 has universal coordinates 109784..120399
Chromosome contig_534487 has universal coordinates 120400..122131
Chromosome contig_535793 has universal coordinates 122132..123830
Chromosome contig_555012 has universal coordinates 123831..179288
Chromosome contig_555933 has universal coordinates 179289..184470
Chromosome contig_556834 has universal coordinates 184471..185570
Chromosome contig_592056 has universal coordinates 185571..190262
Chromosome contig_602108 has universal coordinates 190263..196783
Chromosome contig_633224 has universal coordinates 196784..198490
Chromosome contig_634687 has universal coordinates 198491..205619
Chromosome contig_650065 has universal coordinates 205620..219728
Chromosome contig_682563 has universal coordinates 219729..219944
Chromosome contig_695464 has universal coordinates 219945..220621
Chromosome contig_721559 has universal coordinates 220622..255972
Chromosome contig_767663 has universal coordinates 255973..267527
Chromosome contig_776676 has universal coordinates 267528..281040
Chromosome contig_788404 has universal coordinates 281041..300920
Chromosome contig_789892 has universal coordinates 300921..303507
Chromosome contig_839883 has universal coordinates 303508..328951
Chromosome contig_853386 has universal coordinates 328952..339631
Chromosome contig_872771 has universal coordinates 339632..347448
Chromosome contig_897859 has universal coordinates 347449..363035
Chromosome contig_900031 has universal coordinates 363036..376471
Chromosome contig_933230 has universal coordinates 376472..393525
Chromosome contig_955799 has universal coordinates 393526..400362
Chromosome contig_989836 has universal coordinates 400363..410256
Chromosome contig_992401 has universal coordinates 410257..414584
Chromosome contig_1000503 has universal coordinates 414585..428712
Chromosome contig_1005279 has universal coordinates 428713..429065
Chromosome contig_1036282 has universal coordinates 429066..430909
Chromosome contig_1038394 has universal coordinates 430910..431973
Chromosome contig_1057666 has universal coordinates 431974..432749
Chromosome contig_1144471 has universal coordinates 432750..433024
Chromosome contig_1173093 has universal coordinates 433025..444356
Chromosome contig_1215359 has universal coordinates 444357..447896
Chromosome contig_1257485 has universal coordinates 447897..464323
Chromosome contig_1275019 has universal coordinates 464324..466671
More than 50 contigs.  Will stop printing messages
Writing chromosome IIT file 03_data/indexed_genome/indexed_genome.chromosome.iit
Writing IIT file header information...coordinates require 4 bytes each...done
Processing null division/chromosome...sorting...writing...done (509 intervals)
Writing IIT file footer information...done
Writing IIT file header information...coordinates require 4 bytes each...done
Processing null division/chromosome...sorting...writing...done (509 intervals)
Writing IIT file footer information...done
No alternate scaffolds observed
Running "/usr/lib/gmap/gmap_process"  -c "03_data/indexed_genome.coords" -f "03_data/indexed_genome.sources" | "/usr/lib/gmap/gmapindex"  -d indexed_genome -F "03_data/indexed_genome" -D "03_data/indexed_genome" -G
Genome length is 4542792 nt
Trying to allocate 425889*4 bytes of memory...succeeded.  Building genome in memory.
Reading coordinates from file 03_data/indexed_genome.coords
Writing contig contig_10293114 to universal coordinates 4527948..4528408
Writing contig contig_1990817 to universal coordinates 711269..712071
Writing contig contig_2825224 to universal coordinates 945851..948521
Writing contig contig_3235758 to universal coordinates 1112305..1119509
Writing contig contig_4165115 to universal coordinates 1441933..1448709
Writing contig contig_4270659 to universal coordinates 1450860..1463919
Writing contig contig_4302725 to universal coordinates 1463920..1476140
Writing contig contig_5182773 to universal coordinates 1898605..1902162
Writing contig contig_5850706 to universal coordinates 2076850..2077910
Writing contig contig_6139417 to universal coordinates 2196498..2212240
Writing contig contig_6922011 to universal coordinates 2512854..2520053
Writing contig contig_7040037 to universal coordinates 2555128..2567828
Writing contig contig_767663 to universal coordinates 255973..267527
Writing contig contig_992401 to universal coordinates 410257..414584
Writing contig contig_10263289 to universal coordinates 4517007..4526806
Writing contig contig_1275019 to universal coordinates 464324..466671
Writing contig contig_1467656 to universal coordinates 540170..544608
Writing contig contig_2913181 to universal coordinates 1000611..1010964
Writing contig contig_6393479 to universal coordinates 2271333..2271751
Writing contig contig_6916793 to universal coordinates 2510915..2512853
Writing contig contig_7048192 to universal coordinates 2569810..2578015
Writing contig contig_7852423 to universal coordinates 2982252..2986538
Writing contig contig_7993995 to universal coordinates 3009703..3011151
Writing contig contig_8425974 to universal coordinates 3284433..3291826
Writing contig contig_8558094 to universal coordinates 3352042..3353938
Writing contig contig_8650055 to universal coordinates 3380962..3390087
Writing contig contig_9296210 to universal coordinates 3976533..4004326
Writing contig contig_9392475 to universal coordinates 4140350..4144134
Writing contig contig_9594499 to universal coordinates 4256322..4261455
Writing contig contig_2479530 to universal coordinates 834101..835145
Writing contig contig_5659803 to universal coordinates 2036886..2052465
Writing contig contig_6019695 to universal coordinates 2121794..2127926
Writing contig contig_6409070 to universal coordinates 2287474..2314238
Writing contig contig_682563 to universal coordinates 219729..219944
Writing contig contig_6907888 to universal coordinates 2508866..2510914
Writing contig contig_7146251 to universal coordinates 2611676..2611924
Writing contig contig_7192674 to universal coordinates 2634499..2645289
Writing contig contig_7207442 to universal coordinates 2655003..2668080
Writing contig contig_7235720 to universal coordinates 2692919..2695174
Writing contig contig_8775926 to universal coordinates 3499694..3502564
Writing contig contig_9091474 to universal coordinates 3897132..3911951
Writing contig contig_10181993 to universal coordinates 4509958..4517006
Writing contig contig_1426058 to universal coordinates 513065..520450
Writing contig contig_1811339 to universal coordinates 617904..625350
Writing contig contig_2951315 to universal coordinates 1010965..1013633
Writing contig contig_3399947 to universal coordinates 1151182..1153268
Writing contig contig_3959747 to universal coordinates 1339275..1339554
Writing contig contig_7193903 to universal coordinates 2648730..2655002
Writing contig contig_7282878 to universal coordinates 2775352..2779856
More than 50 contigs.  Will stop printing messages
A total of 0 non-ACGTNX characters were seen in the genome.
Running cat "03_data/indexed_genome/indexed_genome.genomecomp" | "/usr/lib/gmap/gmapindex" -d indexed_genome -U > "03_data/indexed_genome/indexed_genome.genomebits128"
Running cat "03_data/indexed_genome/indexed_genome.genomecomp" | "/usr/lib/gmap/gmapindex" -k 15 -q 3  -d indexed_genome -F "03_data/indexed_genome" -D "03_data/indexed_genome" -N
Counting positions in genome indexed_genome (15 bp every 3 bp), position 0
Number of offsets: 1512047 => pages file not required
Running "/usr/lib/gmap/gmapindex" -k 15 -q 3  -d indexed_genome -F "03_data/indexed_genome" -D "03_data/indexed_genome" -O  "03_data/indexed_genome/indexed_genome.genomecomp"
Offset compression types: bitpack64
Allocating 16777216*1 bytes for packsizes
Allocating 16777216*8 bytes for bitpacks
Indexing offsets of oligomers in genome indexed_genome (15 bp every 3 bp), position 0
Writing 1073741825 offsets compressed via bitpack64...done
Running "/usr/lib/gmap/gmapindex" -k 15 -q 3  -d indexed_genome -F "03_data/indexed_genome" -D "03_data/indexed_genome" -P "03_data/indexed_genome/indexed_genome.genomecomp"
Looking for index files in directory 03_data/indexed_genome
  Pointers file is indexed_genome.ref153offsets64meta
  Offsets file is indexed_genome.ref153offsets64strm
  Positions file is indexed_genome.ref153positions
Expanding offsetsstrm into counters...done
Allocating 21797152 bytes for counterstrm
Trying to allocate 1512047*4 bytes of memory for positions...succeeded.  Building positions in memory.
Indexing positions of oligomers in genome indexed_genome (15 bp every 3 bp), position 0
Writing 1512047 genomic positions to file 03_data/indexed_genome/indexed_genome.ref153positions ...
done
Running "/usr/lib/gmap/gmapindex" -d indexed_genome -F "03_data/indexed_genome" -D "03_data/indexed_genome" -S
Genome length is 4542792
Building suffix array
SACA_K called with n = 4542793, K = 5, level 0
SACA_K called with n = 1276733, K = 0, level 1
SACA_K called with n = 408368, K = 0, level 2
SACA_K called with n = 133769, K = 0, level 3
For indexsize 12, occupied 3462610/16777216
Optimal indexsize = 12
Running "/usr/lib/gmap/gmapindex" -d indexed_genome -F "03_data/indexed_genome" -D "03_data/indexed_genome" -L
Building LCP array
Writing temporary file for rank...done
Writing temporary file for permuted sarray...done
Byte-coding: 4542793 values < 255, 0 exceptions >= 255 (0.0%)
Building DC array
Building child array
Byte-coding: 4527846 values < 255, 14947 exceptions >= 255 (0.3%)
Writing file 03_data/indexed_genome/indexed_genome.salcpchilddcdone
Found 0 exceptions
|                                                                       |
\_______________________________________________________________________/

 _______________________________________________________________________
/                                                                       \
| GAWN: Annotating genome with transcriptome
| --------------------------------------------------------------------- |
|                                                                       |
\_______________________________________________________________________/

 _______________________________________________________________________
/                                                                       \
| GAWN: Adding UTR-3 and UTR-5 regions
| --------------------------------------------------------------------- |
GAWN: Create GTF file from GFF3
GAWN: Create genome-based transcriptome
-parsing cufflinks output: 04_annotation/SRR001665_contigs_greater200.gtf
-parsing genome fasta: 03_data/SRR001665_contigs_greater200.fasta
-done parsing genome.
GAWN: Creage predicted GFF3 from GTF file
GAWN: Find best ORF candidates
NAME
    Transdecoder.LongOrfs <http://transdecoder.github.io> - Transcriptome
    Protein Prediction

USAGE
    Required:

     -t <string>                            transcripts.fasta

    Optional:

     --gene_trans_map <string>              gene-to-transcript identifier mapping file (tab-delimited, gene_id<tab>trans_id<return> ) 

     -m <int>                               minimum protein length (default: 100)
 
     -G <string>                            genetic code (default: universal; see PerlDoc; options: Euplotes, Tetrahymena, Candida, Acetabularia)

     -S                                     strand-specific (only analyzes top strand)

     -p <int>                               shorten potential 5' partials if they are this percentage of the original protein or longer.

Genetic Codes
    See <http://golgi.harvard.edu/biolinks/gencode.html>. These are
    currently supported:

     universal (default)
     Euplotes
     Tetrahymena
     Candida
     Acetabularia
     Mitochondrial-Canonical
     Mitochondrial-Vertebrates
     Mitochondrial-Arthropods
     Mitochondrial-Echinoderms
     Mitochondrial-Molluscs
     Mitochondrial-Ascidians
     Mitochondrial-Nematodes
     Mitochondrial-Platyhelminths
     Mitochondrial-Yeasts
     Mitochondrial-Euascomycetes
     Mitochondrial-Protozoans

GAWN: Move transdecoder_dir
mv: cannot stat 'SRR001665_contigs_greater200.cdna.transdecoder_dir': No such file or directory
GAWN: Create final genome annotation file
Error, cannot locate file: 04_annotation/SRR001665_contigs_greater200.cdna.transdecoder_dir/longest_orfs.gff3 at ./01_scripts/TransDecoder/util/cdna_alignment_orf_to_genome_orf.pl line 23.
GAWN: Copy genome annotation to 05_results
|                                                                       |
\_______________________________________________________________________/

 _______________________________________________________________________
/                                                                       \
| GAWN: Annotating transcriptome with swissprot
| --------------------------------------------------------------------- |
SWISS DB /home/stelarov/programming/rna_seq/gap/packages/lib/uniprot_sprot

The process will hangup on the blastx command and will never end/exit.

The output directories contain:

$ ls 03_data/
evidence.fasta  indexed_genome  SRR001665_contigs_greater200.fasta
$
$ ls 04_annotation/
evidence.hits       genbank_info                       SRR001665_contigs_greater200.gawn_annotated.gff3  SRR001665_contigs_greater200.gtf
evidence.swissprot  SRR001665_contigs_greater200.cdna  SRR001665_contigs_greater200.gff3                 SRR001665_contigs_greater200.predicted.gff3
$
$ ls 05_results/
evidence_annotation_table.tsv  SRR001665_contigs_greater200_annotation_table.tsv  SRR001665_contigs_greater200.gawn_annotated.gff3

Update annotation gff3 file with gene names

Hi Eric,
I wondering if there is some script to update annotated .gff3 file with gene names. In my gff3 file I have all genes named ORF "ID=Contig10.path1;Name=ORF" Is it possible to use transcriptome annotation table or genome annotation table and change ORF to gene name from one of these tables? Thanks. Silvia

gawn- BLAST Database error

I have problem with annotation transcriptome with swissprot. Please could you advise how to set path for swissprot.
I am getting this error:

BLAST Database error: No alias or index file found for protein database [03_data/uniprot_sprot] in search path [/media/a/PRACOVNI1/gawn-master:/home/a/blast_db:]

My swissprot files and uniprot_sprot.fasta are in folder 03_data
Thank you.

Parsing .hits file for Uniprot data (Step 05)

Was running into a small problem when attempting to parse the Blast hits results (Output from step 04) into features and hits in this loop:

cat "$SWISSPROT_HITS" |
    while read i
    do
        echo $i
        feature=$(echo $i | cut -d " " -f 1)
        hit=$(echo $i | cut -d "|" -f 4 | cut -d "." -f 1)
        echo "wget -q -O - http://www.uniprot.org/uniprot/${hit}.txt > $INFO_FOLDER/${feature}.info"
    done > wget_genbank_commands.txt

Because the default Blast hits output is structured:

[qseqid]    [sseqid]
[qseqid2]    [sseqid2]
...

I didn't understand the need for a cut -d "|", since the hit should just be the sseqid (ex:Q8BYH7) for each result.
I replaced the line in the loop with hit=$(echo $i | cut -d " " -f 2), and everything ran smoothly.
Should I be expecting the sseqid to be part of a longer string, such that the pipe delimiter is needed?

A partial of my blast output is attached for reference.

Example.blast.out.txt

conda GAWN

Hello,
is the condo version available for installation of GAWN?

Regards,
B

Correct mapping but strange gene model

Hello,
I ran GAWN using A. thaliana reference transcripts on a genome assembly from the same species.
I notice quite a few cases in which transcripts seem to map very well to the assembly, but the resulting gene model is very different from the reference gene model, especially in terms of CDS features. Here's an example.
The transcript AT1G01190.2 has the following reference gene model:

1       araport11       mRNA    83045   84946   .       -       .       ID=transcript:AT1G01190.2;Parent=gene:AT1G01190;biotype=protein_coding;transcript_id=AT1G01190.2
1       araport11       exon    83045   83671   .       -       .       Parent=transcript:AT1G01190.2;Name=AT1G01190.2.exon2;constitutive=0;ensembl_end_phase=0;ensembl_phase=0;exon_id=AT1G01190.2.exon2;rank=2
1       araport11       CDS     83045   83671   .       -       0       ID=CDS:AT1G01190.2;Parent=transcript:AT1G01190.2;protein_id=AT1G01190.2
1       araport11       CDS     83884   84879   .       -       0       ID=CDS:AT1G01190.2;Parent=transcript:AT1G01190.2;protein_id=AT1G01190.2
1       araport11       exon    83884   84946   .       -       .       Parent=transcript:AT1G01190.2;Name=AT1G01190.2.exon1;constitutive=0;ensembl_end_phase=0;ensembl_phase=-1;exon_id=AT1G01190.2.exon1;rank=1
1       araport11       five_prime_UTR  84880   84946   .       -       .       Parent=transcript:AT1G01190.2

whereas in GAWN output gff3, it looks like this:

1_RaGOO indexed_genome  mRNA    77996   79895   .       -       .       ID=AT1G01190.2.mrna1;Name=AT1G01190.2;Parent=AT1G01190.2.path1;Dir=sensecoverage=100.0;identity=99.7;matches=1685;mismatches=3;indels=2;unknowns=0
1_RaGOO indexed_genome  exon    78835   79895   99      -       .       ID=AT1G01190.2.mrna1.exon1;Name=AT1G01190.2;Parent=AT1G01190.2.mrna1;Target=AT1G01190.2 1 1063 +
1_RaGOO indexed_genome  exon    77996   78622   99      -       .       ID=AT1G01190.2.mrna1.exon2;Name=AT1G01190.2;Parent=AT1G01190.2.mrna1;Target=AT1G01190.2 1064 1690 +
1_RaGOO indexed_genome  CDS     79830   79893   93      -       0       ID=AT1G01190.2.mrna1.cds1;Name=AT1G01190.2;Parent=AT1G01190.2.mrna1;Target=AT1G01190.2 3 68 +

I extracted the transcript and protein sequences based on the GFF3 output. The transcript is highly similar to the reference transcript, with only 2 gaps present in the annotated assembly (you can see the alignment here). However, the resulting protein is very short with virtually no similarity to the reference protein. This is not surprising since the annotated CDS is so short.

As far as I understand, the GFF3 result arises directly from the GMAP analysis, so I'm not sure this is in fact a GAWN issue. I'm also not particularly clear on how GMAP predicts CDS features based on transcript sequences only.
Can you help me understand the reason for the problem or suggest ways to better explore it or get around it?
Do you think that the two gaps in the assembly can cause such a difference in the gene model prediction, or is it more likely a more general issue with GMAP?
Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.