Comments (12)
I notice you switch from conda to singularity while increasing the memory allocation. The two may use different versions of EDTA, Repeatmasker, and rmblast. You may want to check the version of these packages between the two installation..
Shujun
from edta.
@oushujun -- I'm sorry I didn't clarify this in the original post, but I've tried both versions available via singularity and the current version available via conda. The only thing that has worked (tentatively) is increasing cpus-per-task and sometimes memory allocation, but EDTA is not actually using all of the resources allocated (4.21% in the run above) and I can't replicate this success on larger genomes.
from edta.
from edta.
@oushujun -- I have also used the most recent version. EDTA finishes without error in every case, even when it's estimating a total repeat content of 6% for a known 30-33% genome or 15% for a known 80% genome.
from edta.
@oushujun -- I just wanted to follow-up on this. Do you have any guesses as to what might be causing the behavior I described above?
from edta.
@laramiemckenna I am not sure. I have not seen this behavior before. Even for the small Arabidopsis genome, the EDTA annotation is reasonable and captures the major numbers. If you don't see any errors, I don't know what may go wrong. Did you test on the rice or Arabidopsis genome?
from edta.
@oushujun I was not sure what the expected output was for the test data, but I did run it on Arabidopsis TAIR10.1 using the same parameters as I did for the run that was successful above (the second example in the original issue). This is what I got compared to the expected amount of ~21%. I'm extra confused by this because these same exact parameters/version/image were used for the run where it was somewhat successful, but this one wasn't successful.
Script
#!/bin/sh
#SBATCH -e edta_singularity_%j.err
#SBATCH -o edta_singularity_%j.out
#SBATCH --job-name=edta_singularity
#SBATCH --time-min=120:00:00
#SBATCH -c 100
#SBATCH --mem=300G
#SBATCH --partition=plant
#SBATCH --nodes=1
module load cluster/singularity/3.11.0
export PYTHONNOUSERSITE=1
singularity exec [path to]/EDTA.sif EDTA.pl --genome GCA_000001735.2_TAIR10.1_genomic.fna --anno 1
SLURM job output:
Job 1514574 (COMPLETED)
Name edta_singularity
Submit sbatch edta.sh
Nodes plant - plant02
PWD [path to]/arabi_edta_test
Input /dev/null
Output [path to]/arabi_edta_test/edta_singularity_1514574.out
Error [path to]/arabi_edta_test/edta_singularity_1514574.err
Resources CPU = 100 Memory = 307200
Start 2023-10-03 09:31:32
End 2023-10-03 11:58:44
Elapsed 147.2 minutes
Limit 28800 minutes
Exit Code SUCCESS (0)
Usage:
min CPU = 19205.11 sec (5:20:05.11, 2.17 %)
min Mem = 5237.469 MB (1.7 %)
max CPU = 19205.11 sec (5:20:05.11, 2.17 %)
max Mem = 5237.469 MB (1.7 %)
average CPU = 19205.11 sec (5:20:05.11, 2.17 %)
average Mem = 5237.469 MB (1.7 %)
total CPU = 19205.11 sec (5:20:05.11, 2.17 %)
total Mem = 5237.469 MB (1.7 %)
Summary Output:
Repeat Classes
==============
Total Sequences: 7
Total Length: 119482896 bp
Class Count bpMasked %masked
===== ===== ======== =======
LTR -- -- --
Copia 786 925858 0.77%
Gypsy 1634 2410885 2.02%
unknown 405 352244 0.29%
TIR -- -- --
CACTA 589 405463 0.34%
Mutator 1364 792585 0.66%
PIF_Harbinger 284 150803 0.13%
Tc1_Mariner 23 27241 0.02%
hAT 237 105587 0.09%
nonTIR -- -- --
helitron 3066 1818477 1.52%
---------------------------------
total interspersed 8388 6989143 5.85%
---------------------------------------------------------
Total 8388 6989143 5.85%
from edta.
Below is the output of the test run if that helps!
Script (using same parameters)
#!/bin/sh
#SBATCH -e edta_singularity_%j.err
#SBATCH -o edta_singularity_%j.out
#SBATCH --job-name=edta_singularity
#SBATCH --time-min=120:00:00
#SBATCH -c 100
#SBATCH --mem=300G
#SBATCH --partition=plant
#SBATCH --nodes=1
module load cluster/singularity/3.11.0
export PYTHONNOUSERSITE=1
singularity exec [path to]/EDTA.sif EDTA.pl --genome genome.fa --cds genome.cds.fa --curatedlib ../database/rice6.9.5.liban --exclude genome.exclude.bed --overwrite 1 --sensitive 1 --anno 1 --evaluate 1 --threads 10
SLURM Job Output
Job 1515039 (COMPLETED)
Name edta_singularity
Nodes plant - plant02
Command [path to]/test/test_edta.sh
PWD [path to]/test
Input /dev/null
Output [path to]/edta_singularity_1515039.out
Error [path to]/edta_singularity_1515039.err
CPU nodes = 1 cpus = 100 tasks = 1
TRES cpu=100,mem=300G,node=1,billing=100
Start 2023-10-03 13:33:09
End 2023-10-03 13:36:25
Elapsed 7.77 minutes
Limit 28800 minutes
Summary Output:
Repeat Classes
==============
Total Sequences: 1
Total Length: 1000000 bp
Class Count bpMasked %masked
===== ===== ======== =======
LTR -- -- --
Copia 13 18315 1.83%
Gypsy 46 107087 10.71%
TRIM 1 129 0.01%
unknown 1 248 0.02%
TIR -- -- --
CACTA 24 20363 2.04%
Mutator 110 47775 4.78%
PIF_Harbinger 110 27512 2.75%
Tc1_Mariner 124 48718 4.87%
hAT 34 13891 1.39%
unknown 15 2972 0.30%
nonLTR -- -- --
LINE_element 28 10614 1.06%
SINE_element 11 2329 0.23%
nonTIR -- -- --
helitron 81 57826 5.78%
---------------------------------
total interspersed 598 357779 35.78%
---------------------------------------------------------
Total 598 357779 35.78%
Error File:
/opt/conda/lib/python3.6/site-packages/Bio/Seq.py:2983: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
BiopythonWarning,
2023-10-03 13:35:58,608 -INFO- HMM scanning against `/opt/conda/lib/python3.6/site-packages/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm`
2023-10-03 13:35:58,642 -INFO- Creating server instance (pp-1.6.4.4)
2023-10-03 13:35:58,642 -INFO- Running on Python 3.6.13 linux
2023-10-03 13:35:59,080 -INFO- pp local server started with 10 workers
2023-10-03 13:35:59,097 -INFO- Task 0 started
2023-10-03 13:35:59,098 -INFO- Task 1 started
2023-10-03 13:35:59,098 -INFO- Task 2 started
2023-10-03 13:35:59,098 -INFO- Task 3 started
2023-10-03 13:35:59,098 -INFO- Task 4 started
2023-10-03 13:35:59,099 -INFO- Task 5 started
2023-10-03 13:35:59,099 -INFO- Task 6 started
2023-10-03 13:35:59,099 -INFO- Task 7 started
2023-10-03 13:35:59,099 -INFO- Task 8 started
2023-10-03 13:35:59,100 -INFO- Task 9 started
2023-10-03 13:35:59,730 -INFO- generating gene anntations
2023-10-03 13:35:59,748 -INFO- 2 sequences classified by HMM
2023-10-03 13:35:59,748 -INFO- see protein domain sequences in `genome.cds.fa.code.rexdb.dom.faa` and annotation gff3 file in `genome.cds.fa.code.rexdb.dom.gff3`
2023-10-03 13:35:59,748 -INFO- classifying the unclassified sequences by searching against the classified ones
2023-10-03 13:35:59,761 -INFO- using the 80-80-80 rule
2023-10-03 13:35:59,761 -INFO- run CMD: `makeblastdb -in ./tmp/pass1_classified.fa -dbtype nucl`
2023-10-03 13:35:59,827 -INFO- run CMD: `blastn -query ./tmp/pass1_unclassified.fa -db ./tmp/pass1_classified.fa -out ./tmp/pass1_unclassified.fa.blastout -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs qcovhsp sstrand' -num_threads 10`
2023-10-03 13:35:59,940 -INFO- 1 sequences classified in pass 2
2023-10-03 13:35:59,940 -INFO- total 3 sequences classified.
2023-10-03 13:35:59,940 -INFO- see classified sequences in `genome.cds.fa.code.rexdb.cls.tsv`
2023-10-03 13:35:59,940 -INFO- writing library for RepeatMasker in `genome.cds.fa.code.rexdb.cls.lib`
2023-10-03 13:35:59,949 -INFO- writing classified protein domains in `genome.cds.fa.code.rexdb.cls.pep`
2023-10-03 13:35:59,951 -INFO- Summary of classifications:
Order Superfamily # of Sequences# of Clade Sequences # of Clades# of full Domains
LTR Gypsy 1 1 1 0
Maverick unknown 2 0 0 0
2023-10-03 13:35:59,952 -INFO- Pipeline done.
2023-10-03 13:35:59,952 -INFO- cleaning the temporary directory ./tmp
Tue Oct 3 13:36:11 CDT 2023 Homology-based annotation of TEs using genome.fa.mod.EDTA.TElib.fa from scratch.
Out File:
Tue Oct 3 13:34:43 CDT 2023 EDTA advance filtering finished.
Tue Oct 3 13:34:43 CDT 2023 Perform EDTA final steps to generate a non-redundant comprehensive TE library:
Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods.
Tue Oct 3 13:35:58 CDT 2023 Clean up TE-related sequences in the CDS file with TEsorter:
Remove CDS-related sequences in the EDTA library:
Tue Oct 3 13:36:05 CDT 2023 Combine the high-quality TE library rice6.9.5.liban with the EDTA library:
Tue Oct 3 13:36:11 CDT 2023 EDTA final stage finished! You may check out:
The final EDTA TE library: genome.fa.mod.EDTA.TElib.fa
Family names of intact TEs have been updated by rice6.9.5.liban: genome.fa.mod.EDTA.intact.gff3
Comparing to the provided library, EDTA found these novel TEs: genome.fa.mod.EDTA.TElib.novel.fa
The provided library has been incorporated into the final library: genome.fa.mod.EDTA.TElib.fa
Tue Oct 3 13:36:11 CDT 2023 Perform post-EDTA analysis for whole-genome annotation:
Tue Oct 3 13:36:17 CDT 2023 TE annotation using the EDTA library has finished! Check out:
Whole-genome TE annotation (total TE: 35.78%): genome.fa.mod.EDTA.TEanno.gff3
Whole-genome TE annotation summary: genome.fa.mod.EDTA.TEanno.sum
Low-threshold TE masking for MAKER gene annotation (masked: 16.47%): genome.fa.mod.MAKER.masked
Tue Oct 3 13:36:17 CDT 2023 Evaluate the level of inconsistency for whole-genome TE annotation (slow step):
Tue Oct 3 13:36:25 CDT 2023 Evaluation of TE annotation finished! Check out these files:
Overall: genome.fa.mod.EDTA.TE.fa.stat.all.sum
Nested: genome.fa.mod.EDTA.TE.fa.stat.nested.sum
Non-nested: genome.fa.mod.EDTA.TE.fa.stat.redun.sum
from edta.
@laramiemckenna sorry, I also don't understand why you have this low % of TE in Arabidopsis. The only abnormal thing I see is the use of the singularity
version, which is old and outdated. You may want to try the conda
version instead and use the latest github code.
from edta.
@oushujun The EDTA.yml file for the conda installation still specifies EDTA 2.0.1 but the rest of the repo appear to be much newer (2.1.3). Is there a newer version of this yaml file available, or details on how to mix your conda installation instructions with the newer code in the repo?
from edta.
@oushujun -- do you mean that I should use the 2.1.0 version and use EDTA.pl through the current repository, which is 2.1.3?
from edta.
from edta.
Related Issues (20)
- Exploring the Discrepancies between TEanno.gff3 and TElib.fa HOT 2
- Inflated TE counts and masked bp in EDTA annotation after removal of part of the genome HOT 3
- panEDTA for metazoans HOT 5
- the pipeline about panEDTA HOT 6
- Two genomes LTR result file has 0 bp, while the others are not. HOT 3
- Use of uninitialized value $iden in numeric lt (<) at /home/data/ycy/sofw/EDTA/util/TE_purifier.pl line 184. HOT 2
- ERROR in TE annotation stats HOT 9
- make_masked.pl Permission denied HOT 1
- ERROR: Intact TE annotation not found in Aros.genome.fa.mod.EDTA.intact.gff3 at EDTA.pl line 638. HOT 1
- The inconsistency between the classification labels of the EDTA library and RepeatMasker HOT 1
- Combining RM rows HOT 5
- fa file HOT 1
- ERROR: TE annotation stats results not found in mod.EDTA.TE.fa.stat! HOT 1
- dependencies not found in singularity install HOT 4
- Would you consider adding -w $genome_file_real_path.EDTA.raw/TIR to line 576 of EDTA_raw.pl HOT 1
- Can't locate SearchResult.pm in @INC (you may need to install the SearchResult module) HOT 3
- Hey!Have you solved your problem? I had the same problem. HOT 1
- I just found that this script in the (../../share/RepeatMasker/) folder will not have this error, maybe can copy the input file, I think it can try?If you tried, can you tell me the result? HOT 3
- TIR not found? HOT 1
- 文件缺失 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from edta.