GithubHelp home page GithubHelp logo

Comments (12)

oushujun avatar oushujun commented on June 12, 2024

I notice you switch from conda to singularity while increasing the memory allocation. The two may use different versions of EDTA, Repeatmasker, and rmblast. You may want to check the version of these packages between the two installation..

Shujun

from edta.

laramiemckenna avatar laramiemckenna commented on June 12, 2024

@oushujun -- I'm sorry I didn't clarify this in the original post, but I've tried both versions available via singularity and the current version available via conda. The only thing that has worked (tentatively) is increasing cpus-per-task and sometimes memory allocation, but EDTA is not actually using all of the resources allocated (4.21% in the run above) and I can't replicate this success on larger genomes.

from edta.

oushujun avatar oushujun commented on June 12, 2024

from edta.

laramiemckenna avatar laramiemckenna commented on June 12, 2024

@oushujun -- I have also used the most recent version. EDTA finishes without error in every case, even when it's estimating a total repeat content of 6% for a known 30-33% genome or 15% for a known 80% genome.

from edta.

laramiemckenna avatar laramiemckenna commented on June 12, 2024

@oushujun -- I just wanted to follow-up on this. Do you have any guesses as to what might be causing the behavior I described above?

from edta.

oushujun avatar oushujun commented on June 12, 2024

@laramiemckenna I am not sure. I have not seen this behavior before. Even for the small Arabidopsis genome, the EDTA annotation is reasonable and captures the major numbers. If you don't see any errors, I don't know what may go wrong. Did you test on the rice or Arabidopsis genome?

from edta.

laramiemckenna avatar laramiemckenna commented on June 12, 2024

@oushujun I was not sure what the expected output was for the test data, but I did run it on Arabidopsis TAIR10.1 using the same parameters as I did for the run that was successful above (the second example in the original issue). This is what I got compared to the expected amount of ~21%. I'm extra confused by this because these same exact parameters/version/image were used for the run where it was somewhat successful, but this one wasn't successful.

Script

#!/bin/sh
#SBATCH -e edta_singularity_%j.err
#SBATCH -o edta_singularity_%j.out
#SBATCH --job-name=edta_singularity
#SBATCH --time-min=120:00:00
#SBATCH -c 100
#SBATCH --mem=300G
#SBATCH --partition=plant
#SBATCH --nodes=1

module load cluster/singularity/3.11.0

export PYTHONNOUSERSITE=1

singularity exec [path to]/EDTA.sif EDTA.pl --genome GCA_000001735.2_TAIR10.1_genomic.fna --anno 1

SLURM job output:

Job     	1514574 (COMPLETED)
Name    	edta_singularity
Submit  	sbatch edta.sh
Nodes   	plant - plant02
PWD     	[path to]/arabi_edta_test
Input   	/dev/null
Output  	[path to]/arabi_edta_test/edta_singularity_1514574.out
Error   	[path to]/arabi_edta_test/edta_singularity_1514574.err
Resources 	CPU = 100 Memory = 307200
Start   	2023-10-03 09:31:32
End     	2023-10-03 11:58:44
Elapsed 	147.2 minutes
Limit   	28800 minutes
Exit Code   	SUCCESS (0)

Usage:
min       	CPU = 19205.11 sec (5:20:05.11, 2.17 %)
min       	Mem = 5237.469 MB (1.7 %)
max       	CPU = 19205.11 sec (5:20:05.11, 2.17 %)
max       	Mem = 5237.469 MB (1.7 %)
average       	CPU = 19205.11 sec (5:20:05.11, 2.17 %)
average       	Mem = 5237.469 MB (1.7 %)
total       	CPU = 19205.11 sec (5:20:05.11, 2.17 %)
total       	Mem = 5237.469 MB (1.7 %)

Summary Output:

Repeat Classes
==============
Total Sequences: 7
Total Length: 119482896 bp
Class                  Count        bpMasked    %masked
=====                  =====        ========     =======
LTR                    --           --           --
    Copia              786          925858	 0.77%
    Gypsy              1634         2410885	 2.02%
    unknown            405          352244	 0.29%
TIR                    --           --           --
    CACTA              589          405463	 0.34%
    Mutator            1364         792585	 0.66%
    PIF_Harbinger      284          150803	 0.13%
    Tc1_Mariner        23           27241        0.02%
    hAT                237          105587	 0.09%
nonTIR                 --           --           --
    helitron           3066         1818477	 1.52%
                      ---------------------------------
    total interspersed 8388         6989143	 5.85%

---------------------------------------------------------
Total                  8388         6989143	 5.85%

from edta.

laramiemckenna avatar laramiemckenna commented on June 12, 2024

Below is the output of the test run if that helps!

Script (using same parameters)

#!/bin/sh
#SBATCH -e edta_singularity_%j.err
#SBATCH -o edta_singularity_%j.out
#SBATCH --job-name=edta_singularity
#SBATCH --time-min=120:00:00
#SBATCH -c 100
#SBATCH --mem=300G
#SBATCH --partition=plant
#SBATCH --nodes=1

module load cluster/singularity/3.11.0

export PYTHONNOUSERSITE=1

singularity exec [path to]/EDTA.sif EDTA.pl --genome genome.fa --cds genome.cds.fa --curatedlib ../database/rice6.9.5.liban --exclude genome.exclude.bed --overwrite 1 --sensitive 1 --anno 1 --evaluate 1 --threads 10

SLURM Job Output

Job     	1515039 (COMPLETED)
Name    	edta_singularity
Nodes   	plant - plant02
Command 	[path to]/test/test_edta.sh
PWD     	[path to]/test
Input   	/dev/null
Output  	[path to]/edta_singularity_1515039.out
Error   	[path to]/edta_singularity_1515039.err
CPU     	nodes = 1 cpus = 100 tasks = 1
TRES    	cpu=100,mem=300G,node=1,billing=100
Start   	2023-10-03 13:33:09
End     	2023-10-03 13:36:25
Elapsed 	7.77 minutes
Limit   	28800 minutes

Summary Output:

Repeat Classes
==============
Total Sequences: 1
Total Length: 1000000 bp
Class                  Count        bpMasked    %masked
=====                  =====        ========     =======
LTR                    --           --           --
    Copia              13           18315        1.83%
    Gypsy              46           107087	 10.71%
    TRIM               1            129          0.01%
    unknown            1            248          0.02%
TIR                    --           --           --
    CACTA              24           20363        2.04%
    Mutator            110          47775        4.78%
    PIF_Harbinger      110          27512        2.75%
    Tc1_Mariner        124          48718        4.87%
    hAT                34           13891        1.39%
    unknown            15           2972         0.30%
nonLTR                 --           --           --
    LINE_element       28           10614        1.06%
    SINE_element       11           2329         0.23%
nonTIR                 --           --           --
    helitron           81           57826        5.78%
                      ---------------------------------
    total interspersed 598          357779	 35.78%

---------------------------------------------------------
Total                  598          357779	 35.78%

Error File:

/opt/conda/lib/python3.6/site-packages/Bio/Seq.py:2983: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
  BiopythonWarning,
2023-10-03 13:35:58,608 -INFO- HMM scanning against `/opt/conda/lib/python3.6/site-packages/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm`
2023-10-03 13:35:58,642 -INFO- Creating server instance (pp-1.6.4.4)
2023-10-03 13:35:58,642 -INFO- Running on Python 3.6.13 linux
2023-10-03 13:35:59,080 -INFO- pp local server started with 10 workers
2023-10-03 13:35:59,097 -INFO- Task 0 started
2023-10-03 13:35:59,098 -INFO- Task 1 started
2023-10-03 13:35:59,098 -INFO- Task 2 started
2023-10-03 13:35:59,098 -INFO- Task 3 started
2023-10-03 13:35:59,098 -INFO- Task 4 started
2023-10-03 13:35:59,099 -INFO- Task 5 started
2023-10-03 13:35:59,099 -INFO- Task 6 started
2023-10-03 13:35:59,099 -INFO- Task 7 started
2023-10-03 13:35:59,099 -INFO- Task 8 started
2023-10-03 13:35:59,100 -INFO- Task 9 started
2023-10-03 13:35:59,730 -INFO- generating gene anntations
2023-10-03 13:35:59,748 -INFO- 2 sequences classified by HMM
2023-10-03 13:35:59,748 -INFO- see protein domain sequences in `genome.cds.fa.code.rexdb.dom.faa` and annotation gff3 file in `genome.cds.fa.code.rexdb.dom.gff3`
2023-10-03 13:35:59,748 -INFO- classifying the unclassified sequences by searching against the classified ones
2023-10-03 13:35:59,761 -INFO- using the 80-80-80 rule
2023-10-03 13:35:59,761 -INFO- run CMD: `makeblastdb -in ./tmp/pass1_classified.fa -dbtype nucl`
2023-10-03 13:35:59,827 -INFO- run CMD: `blastn -query ./tmp/pass1_unclassified.fa -db ./tmp/pass1_classified.fa -out ./tmp/pass1_unclassified.fa.blastout -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs qcovhsp sstrand' -num_threads 10`
2023-10-03 13:35:59,940 -INFO- 1 sequences classified in pass 2
2023-10-03 13:35:59,940 -INFO- total 3 sequences classified.
2023-10-03 13:35:59,940 -INFO- see classified sequences in `genome.cds.fa.code.rexdb.cls.tsv`
2023-10-03 13:35:59,940 -INFO- writing library for RepeatMasker in `genome.cds.fa.code.rexdb.cls.lib`
2023-10-03 13:35:59,949 -INFO- writing classified protein domains in `genome.cds.fa.code.rexdb.cls.pep`
2023-10-03 13:35:59,951 -INFO- Summary of classifications:
Order           Superfamily	 # of Sequences# of Clade Sequences    # of Clades# of full Domains
LTR             Gypsy                         1              1              1              0
Maverick        unknown                       2              0              0              0
2023-10-03 13:35:59,952 -INFO- Pipeline done.
2023-10-03 13:35:59,952 -INFO- cleaning the temporary directory ./tmp
Tue Oct  3 13:36:11 CDT 2023    Homology-based annotation of TEs using genome.fa.mod.EDTA.TElib.fa from scratch.

Out File:

Tue Oct  3 13:34:43 CDT 2023    EDTA advance filtering finished.

Tue Oct  3 13:34:43 CDT 2023    Perform EDTA final steps to generate a non-redundant comprehensive TE library:

                                Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods.

Tue Oct  3 13:35:58 CDT 2023    Clean up TE-related sequences in the CDS file with TEsorter:

                                Remove CDS-related sequences in the EDTA library:

Tue Oct  3 13:36:05 CDT 2023    Combine the high-quality TE library rice6.9.5.liban with the EDTA library:

Tue Oct  3 13:36:11 CDT 2023    EDTA final stage finished! You may check out:
                                The final EDTA TE library: genome.fa.mod.EDTA.TElib.fa
                                Family names of intact TEs have been updated by rice6.9.5.liban: genome.fa.mod.EDTA.intact.gff3
                                Comparing to the provided library, EDTA found these novel TEs: genome.fa.mod.EDTA.TElib.novel.fa
                                The provided library has been incorporated into the final library: genome.fa.mod.EDTA.TElib.fa

Tue Oct  3 13:36:11 CDT 2023    Perform post-EDTA analysis for whole-genome annotation:

Tue Oct  3 13:36:17 CDT 2023    TE annotation using the EDTA library has finished! Check out:
                                Whole-genome TE annotation (total TE: 35.78%): genome.fa.mod.EDTA.TEanno.gff3
                                Whole-genome TE annotation summary: genome.fa.mod.EDTA.TEanno.sum
                                Low-threshold TE masking for MAKER gene annotation (masked: 16.47%): genome.fa.mod.MAKER.masked

Tue Oct  3 13:36:17 CDT 2023    Evaluate the level of inconsistency for whole-genome TE annotation (slow step):

Tue Oct  3 13:36:25 CDT 2023    Evaluation of TE annotation finished! Check out these files:

                                Overall: genome.fa.mod.EDTA.TE.fa.stat.all.sum
                                Nested: genome.fa.mod.EDTA.TE.fa.stat.nested.sum
                                Non-nested: genome.fa.mod.EDTA.TE.fa.stat.redun.sum

from edta.

oushujun avatar oushujun commented on June 12, 2024

@laramiemckenna sorry, I also don't understand why you have this low % of TE in Arabidopsis. The only abnormal thing I see is the use of the singularity version, which is old and outdated. You may want to try the conda version instead and use the latest github code.

from edta.

rjohnson-ha avatar rjohnson-ha commented on June 12, 2024

@oushujun The EDTA.yml file for the conda installation still specifies EDTA 2.0.1 but the rest of the repo appear to be much newer (2.1.3). Is there a newer version of this yaml file available, or details on how to mix your conda installation instructions with the newer code in the repo?

from edta.

laramiemckenna avatar laramiemckenna commented on June 12, 2024

@oushujun -- do you mean that I should use the 2.1.0 version and use EDTA.pl through the current repository, which is 2.1.3?

from edta.

oushujun avatar oushujun commented on June 12, 2024

from edta.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.