GithubHelp home page GithubHelp logo

oushujun / edta Goto Github PK

View Code? Open in Web Editor NEW
309.0 9.0 66.0 232.33 MB

Extensive de-novo TE Annotator

Home Page: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y

License: GNU General Public License v3.0

Perl 80.08% Shell 5.88% Python 11.64% PureBasic 1.38% R 1.01%
transposable-elements genome-annotation benchmarking pipeline

edta's Introduction

install with bioconda Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

The Extensive de novo TE Annotator (EDTA)

Table of Contents

Introduction

This package is developed for automated whole-genome de-novo TE annotation and benchmarking the annotation performance of TE libraries.

The EDTA package was designed to filter out false discoveries in raw TE candidates and generate a high-quality non-redundant TE library for whole-genome TE annotations. Selection of initial search programs were based on benckmarkings on the annotation performance using a manually curated TE library in the rice genome.

The EDTA workflow

To benchmark the annotation quality of a new library/method, I have provided the TE annotation with the curated rice TE library (v7.0.0) for the rice genome (TIGR7/MSU7 version). You may use the lib-test.pl script to compare the annotation performance of your method/library to the methods we have tested (usage shown below).

For pan-genome annotations, you need to annotate each genome with EDTA, generate a pan-genome library, then reannotate each genome with the pan-genome library. Please refer to this example for details. A sequential version of panEDTA is also included in this package.

Installation

There are several ways to install EDTA. You just need to find the one that works for your system. If you are not using macOS, you may try the conda approach before the Singularity approach.

Install with conda/mamba (Linux64)

Recommend to ceate a dedicated environment for EDTA:

conda create -n EDTA
conda activate EDTA
mamba install -c conda-forge -c bioconda edta
Other ways to install with conda/mamba...
  1. Install with the yml file:

Download the latest EDTA:

git clone https://github.com/oushujun/EDTA.git

Find the yml file in the EDTA folder and run:

mamba env create -f EDTA_2.2.x.yml

The default conda env name is EDTA2 specified by the first line of the yml file. You may change that to different names.

  1. Install by specifying all dependencies:

mamba create -n EDTA2.2 -c conda-forge -c bioconda -c r annosine2 biopython blast cd-hit coreutils genericrepeatfinder genometools-genometools glob2 h5py==3.9 keras==2.11 ltr_finder ltr_retriever mdust multiprocess muscle openjdk pandas perl perl-text-soundex pyarrow python r-base r-dplyr regex repeatmodeler r-ggplot2 r-here r-tidyr scikit-learn swifter tensorflow==2.11 tesorter

Usage:

conda activate EDTA
perl EDTA.pl

You can use the conda ENV to execute the latest EDTA from GitHub:

git clone https://github.com/oushujun/EDTA.git
perl ./EDTA/EDTA/pl

Install with Singularity (good for HPC users)

SINGULARITY_CACHEDIR=./
export SINGULARITY_CACHEDIR
`singularity pull EDTA.sif docker://quay.io/biocontainers/edta:<tag>`

Visit BioContainers repository for a list of available tags (e.g., 2.2.0--hdfd78af_1).

Usage:

export PYTHONNOUSERSITE=1
singularity exec {path}/EDTA.sif EDTA.pl --genome genome.fa [other parameters]

Where {path} is the path you build the EDTA singularity image.

Install with Docker (good for root/macOS/Apple M-chip users)

sudo docker pull quay.io/biocontainers/edta:<tag>

Usage:

sudo docker run -v $PWD:/in -w /in quay.io/biocontainers/edta:<tag> EDTA.pl --genome genome.fa [other parameters]

Visit BioContainers repository for a list of available tags (e.g., 2.2.0--hdfd78af_1).

Note: Because only the current directory is mounted to the EDTA docker container, you have to copy all needed files to the current directory and provide them to EDTA without path specifications. Even providing the absolute path to the file located in this folder won't work. Softlinked files are considered "with path" and won't work. Similarily, specifying your own versions of dependency programs (i.e., repeatmasker, repeatmodeler) won't work because they have paths.

Testing

You should test the EDTA pipeline with a 1-Mb toy genome, which takes about five mins. If your test finishs without any errors (warnings are OK), then EDTA should be correctly installed. If the test is OK but you encounter errors with your data, you should check your own data for any formating/naming mistakes.

cd ./EDTA/test
perl ../EDTA.pl --genome genome.fa --cds genome.cds.fa --curatedlib ../database/rice7.0.0.liban --exclude genome.exclude.bed --overwrite 1 --sensitive 1 --anno 1 --threads 10

If your test fails, you may check out this collection of issues for possible reasons and solutions. If none works, you may open a new issue.

Inputs

Required: The genome file [FASTA]. Please make sure sequence names are short (<=13 characters) and simple (i.e, letters, numbers, and underscore).

Optional:

  1. Coding sequence of the species or a closely related species [FASTA]. This file helps to purge gene sequences in the TE library.
  2. Known gene positions of this version of the genome assembly [BED]. Coordinates specified in this file will be excluded from TE annotation to avoid over-masking.
  3. Curated TE library of the species [FASTA]. This file is trusted 100%. Please make sure it's curated. If you only have a couple of curated sequences, that's also good. It doesn't need to be complete. Providing curated TE sequences, especially for those under-annotated TE types (i.e., SINEs and LINEs), will greatly improve the annotation quality. For more information, please visit this wiki page: How to prepare a curated library to maximize the efficacy of EDTA

Outputs

A non-redundant TE library: $genome.mod.EDTA.TElib.fa. The curated library will be included in this file if provided. The rice library will be (partially) included if --force 1 is specified. TEs are classified into the superfamily level and using the three-letter naming system reported in Wicker et al. (2007). Each sequence can be considered as a TE family. To convert between classification systems, please refer to the TE sequence ontology file.

Optional 1:

  1. Novel TE families: $genome.mod.EDTA.TElib.novel.fa. This file contains TE sequences that are not included in the curated library (--curatedlib required).

Optional 2, when you specify the --anno 1 parameter, you will get:
2. Whole-genome TE annotation: $genome.mod.EDTA.TEanno.gff3. This file contains both structurally intact and fragmented TE annotations.
3. Summary of whole-genome TE annotation: $genome.mod.EDTA.TEanno.sum.
4. Low-threshold TE masking: $genome.mod.MAKER.masked. This is a genome file with only long TEs (>=1 kb) being masked. You may use this for de novo gene annotations. In practice, this approach will reduce overmasking for genic regions, which can improve gene prediction quality. However, initial gene models should contain TEs and need further filtering.
5. Annotation inconsistency for simple TEs: $genome.mod.EDTA.TE.fa.stat.redun.sum.
6. Annotation inconsistency for nested TEs: $genome.mod.EDTA.TE.fa.stat.nested.sum.
7. Oveall annotation inconsistency: $genome.mod.EDTA.TE.fa.stat.all.sum.

EDTA Usage

From head to toe

You got a genome and you want to get a high-quality TE annotation:

perl EDTA.pl [options]
  --genome [File]		The genome FASTA file. Required.
  --species [Rice|Maize|others]	Specify the species for identification of TIR candidates. Default: others
  --step [all|filter|final|anno]	Specify which steps you want to run EDTA.
				 all: run the entire pipeline (default)
				 filter: start from raw TEs to the end.
				 final: start from filtered TEs to finalizing the run.
				 anno: perform whole-genome annotation/analysis after TE library construction.
  --overwrite [0|1]		If previous results are found, decide to overwrite (1, rerun) or not (0, default).
  --cds [File]		Provide a FASTA file containing the coding sequence (no introns, UTRs, nor TEs) of this genome or its close relative.
  --curatedlib [file]	Provided a curated library to keep consistant naming and classification for known TEs.
			All TEs in this file will be trusted 100%, so please ONLY provide MANUALLY CURATED ones here.
			 This option is not mandatory. It's totally OK if no file is provided (default).
  --sensitive [0|1]		Use RepeatModeler to identify remaining TEs (1) or not (0, default).
			 This step is very slow and MAY help to recover some TEs.
  --anno [0|1]	Perform (1) or not perform (0, default) whole-genome TE annotation after TE library construction.
  --rmout [File]	Provide your own homology-based TE annotation instead of using the EDTA library for masking.
		File is in RepeatMasker .out format. This file will be merged with the structural-based TE annotation. (--anno 1 required).
		Default: use the EDTA library for annotation.
  --evaluate [0|1]	Evaluate (1) classification consistency of the TE annotation. (--anno 1 required). Default: 0.
		 This step is slow and does not affect the annotation result.
  --exclude	[File]	Exclude regions (bed format) from TE masking in the MAKER.masked output. Default: undef. (--anno 1 required).
  --u [float]	Neutral mutation rate to calculate the age of intact LTR elements.
		 Intact LTR age is found in this file: *EDTA_raw/LTR/*.pass.list. Default: 1.3e-8 (per bp per year, from rice).
  --threads|-t	[int]	Number of theads to run this script (default: 4)
  --help|-h	Display this help info

Divide and conquer

Identify intact elements of a paticular TE type:

1.Get raw TEs from a genome (specify -type ltr|tir|helitron in different runs)

perl EDTA_raw.pl [options]
  --genome	[File]	The genome FASTA
  --species [Rice|Maize|others]	Specify the species for identification of TIR candidates. Default: others
  --type	[ltr|tir|helitron|line|sine|all]
			Specify which type of raw TE candidates you want to get. Default: all
  --overwrite	[0|1]	If previous results are found, decide to overwrite (1, rerun) or not (0, default).
  --threads|-t	[int]	Number of theads to run this script
  --help|-h	Display this help info

2.Finish the rest of the EDTA analysis (specify -overwrite 0 and it will automatically pick up existing results in the work folder)

perl EDTA.pl --overwrite 0 [options]

Protips and self-diagnosis

  1. It's never said enough. You should tidy up all your sequence names before ANY analysis. Keep them short, simple, and unique.
  2. Run it in a fast drive (i.e., SSD) because RepeatMasker/RepeatModeler is I/O intense.
  3. Check out the Wiki page for more information and frequently asked questions.

panEDTA usage

This is the serial version of panEDTA. Each genome will be annotated sequentially and then combined with the panEDTA functionality. Existing EDTA annotation of genomes (EDTA run with --anno 1) will be recognized and reused. A way to acclerate the pan-genome annotation is to execute EDTA annotation of each genomes separately and in parallel, then execute panEDTA to finish the remaining of the runs. You may want to save the GFF files and the sum file of the EDTA results of each genome because they will be overwritten by panEDTA. To help filtering out gene-related sequences, at least one CDS file is required. Please read wiki for the CDS requirement. You may want to check out the toy example in the ./test folder to get familiarized.

sh panEDTA.sh -g genome_list.txt -c cds.fasta -t 10
    -g	A list of genome files with paths accessible from the working directory.
                Required: You can provide only a list of genomes in this file (one column, one genome each row).
                Optional: You can also provide both genomes and CDS files in this file (two columns, one genome and one CDS each row).
                    Missing of CDS files (eg, for some or all genomes) is allowed.
    -c	Required. Coding sequence files in fasta format.
                The CDS file provided via this parameter will fill in the missing CDS files in the genome list.
                If no CDS files are provided in the genome list, then this CDS file will be used on all genomes.
    -l	Optional. A manually curated, non-redundant library following the RepeatMasker naming format.
    -f	Minimum number of full-length TE copies in individual genomes to be kept as candidate TEs for the pangenome.
                Lower is more inclusive, and will ↑ library size, ↑ sensitivity, and ↑ inconsistency.
                Higher is more stringent, and will ↓ library size, ↓ sensitivity, and ↓ inconsistency.
                Default: 3.
    -t	Number of CPUs to run panEDTA. Default: 10.

Benchmark

If you developed a new TE method/got a TE library and want to compare it's annotation performance to the methods we have tested, you can:

1.annotate the rice genome with your test library:

RepeatMasker -e ncbi -pa 36 -q -no_is -norna -nolow -div 40 -lib custom.TE.lib.fasta -cutoff 225 rice_genome.fasta

2.Test the annotation performance of a particular TE category.

perl lib-test.pl -genome genome.fasta -std genome.stdlib.RM.out -tst genome.testlib.RM.out -cat [options]
    -genome	[file]	FASTA format genome sequence
    -std	[file]	RepeatMasker .out file of the standard library
    -tst	[file]	RepeatMasker .out file of the test library
    -cat	[string]	Testing TE category. Use one of LTR|nonLTR|LINE|SINE|TIR|MITE|Helitron|Total|Classified
    -N	[0|1]	Include Ns in total length of the genome. Defaule: 0 (not include Ns).
    -unknown	[0|1]	Include unknown annotations to the testing category. This should be used when
                    the test library has no classification and you assume they all belong to the
                    target category specified by -cat. Default: 0 (not include unknowns)

eg.

perl lib-test.pl -genome rice_genome.fasta -std ./EDTA/database/Rice_MSU7.fasta.std7.0.0.out -tst rice_genome.fasta.test.out -cat LTR

Note: the -std and -tst files should be named differently even they are placed in different folders.

Citations

Please cite our paper if you find EDTA useful:

Ou S., Su W., Liao Y., Chougule K., Agda J. R. A., Hellinga A. J., Lugo C. S. B., Elliott T. A., Ware D., Peterson T., Jiang N.✉, Hirsch C. N.✉ and Hufford M. B.✉ (2019). Benchmarking Transposable Element Annotation Methods for Creation of a Streamlined, Comprehensive Pipeline. Genome Biol. 20(1): 275.

Please cite the panEDTA paper if you are using the pan-genome functionality:

Ou S., Collins T., Qiu Y., Seetharam A., Menard C., Manchanda N., Gent J., Schatz M., Anderson S., Hufford M.✉, Hirsch C.✉ (2022). Differences in activity and stability drive transposable element variation in tropical and temperate maize. bioRxiv

Please also cite the software packages that were used in EDTA, listed in the EDTA/bin directory.

Other resources

You may download the rice genome here (the "all.con" file).

Questions and Issues

You may want to check out this Q&A page for best practices and get answered. If you have other issues with installation and usage, please check if similar issues have been reported in Issues or open a new issue. If you are (looking for) happy users, please read or write successful cases here.

Acknowledgements

I want to thank Jacques Dainat for contribution of the EDTA conda recipe as well as improving the codes. I also want to thank Qiushi Li, Zhigui Bao, Philipp Bayer, Nick Carleson, @aderzelle, Sanzhen Liu, Zhougeng Xu, Shun Wang, Nancy Manchanda, Eric Burgueño, Sergei Ryazansky, and many more others for testing, debugging, and improving the EDTA pipeline.

edta's People

Contributors

baozg avatar cwb14 avatar eburgueno avatar eernst avatar hkchi avatar hyphaltip avatar juke34 avatar lutianyu2001 avatar oushujun avatar rossibarra avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

edta's Issues

Debug lines found in EDTA.pl

We tried to run the pipeline using our genome assembly fasta file, xxx.fa. Unfortunately,
the error message showed up "xxx.fa.masked does not contain any sequences!"
What's going on?
Apparently, at line 48 of the code of EDTA.pl , "if (0){", should be changed to "if (1){".

Double ID annotation?

Hello,
not sure if it's EDTA or RepeatMasker, but I ran on the EDTA output

RepeatMasker -pa 4 -no_is -norna -nolow -div 40 -lib genome.sixLongest.fa.EDTA.TElib.fa -cutoff 225 -gff genome.FLYE.sixLongest.fa
buildSummary.pl genome.FLYE.sixLongest.fa.out > summary.tbl

and some sequences in the output repeat tables have a "doubled" ID like
TE_00001277_INT-int while the sequence ID is
>TE_00001277_INT#LTR/Gypsy

Any idea of where the "int" after the "INT" comes from? I concede it seems absolutely benign but I want to be sure it doesn't clue to a bigger problem.

Thank you

Call_seq.pl script giving an empty output.

When running EDTA_raw.pl script the output for both TIR and Helitron raw fasta files are empty. I think the problem is at the call_seq.pl script because the TIR.ext30.list gives an output such as:

000000F:152380..154395 000000F:152350..154425
000000F:292101..295163 000000F:292071..295193
000000F:429115..433751 000000F:429085..433781
000000F:433252..438167 000000F:433222..438197

But then the TIR.ext30.fa is empty

I tried to call the script alone:
perl $call_seq $seq.ext$extlen.list -C $genome
but it doesn't give any output neither.

Same situation applies for HelitronScanner.raw.ext.list and HelitronScanner.raw.ext.fa

The fasta header format is as follows:

000160F 000285294:B000285294:B000284495:B~000284495:B ctg_linear 11256 10841

but even with no spaces the problem persists:

000161F_000058666:E000058666:E000414072:B~000414072:B_ctg_linear_15599_15577

Any help or suggestion will be appreciated.
Thanks!

Fails on identification of TIRs

Shujun,

I've installed the Docker version to our HPC. EDTA progesses through the LTR finding, but then crashes when trying to identify the TIR. I'm pasting in the full slurm below. Any help would be greatly appreciated.
Thanks, Jeff

> ##### Shujun Ou ([email protected])             ####
> ########################################################
>
>
>
> Fri Jan 24 02:09:53 UTC 2020    Dependency checking:
>                                 All passed!
> Fri Jan 24 02:10:03 UTC 2020    Obtain raw TE libraries using various structure-based programs:
> Fri Jan 24 02:10:03 UTC 2020    EDTA_raw: Check files and dependencies, prepare working directories.
>
> Fri Jan 24 02:10:03 UTC 2020    Start to find LTR candidates.
>
> Fri Jan 24 02:10:03 UTC 2020    Identify LTR retrotransposon candidates from scratch.
>
> Warning: LOC list ordered_atriplex_hortensis_04Apr2019_hkF1T_namedcorrectly_clean_00.fasta.mod.ltrTE.veryfalse is empty.
> Fri Jan 24 02:27:35 UTC 2020    Finish finding LTR candidates.
>
> Fri Jan 24 02:27:35 UTC 2020    Start to find TIR candidates.
>
> Fri Jan 24 02:27:35 UTC 2020    Identify TIR candidates from scratch.
>
> Species: others
> 2020-01-24 02:55:41.202424: F tensorflow/python/lib/core/bfloat16.cc:675] Check failed: PyBfloat16_Type.tp_base != nullptr
> Aborted (core dumped)
> cat: '*-+-DTA.fa': No such file or directory
> cat: '*-+-DTC.fa': No such file or directory
> cat: '*-+-DTH.fa': No such file or directory
> cat: '*-+-DTM.fa': No such file or directory
> cat: '*-+-DTT.fa': No such file or directory
> cat: '*-+-NonTIR.fa': No such file or directory
> cat: '*-+-*-+-*.gff3': No such file or directory
> rm: cannot remove '*-+-*-+-*.gff3': No such file or directory
> Traceback (most recent call last):
>   File "/EDTA/bin/TIR-Learner2.4/Module3_New/CombineAll.py", line 75, in <module>
>     f_m3=removeDupinSingle("%s.gff3"%(genome_Name+spliter+"Module3"))
>   File "/EDTA/bin/TIR-Learner2.4/Module3_New/CombineAll.py", line 57, in removeDupinSingle
>     f=pd.read_csv(file,header=None,sep="\t") #shujun
>   File "/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py", line 685, in parser_f
>     return _read(filepath_or_buffer, kwds)
>   File "/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py", line 457, in _read
>     parser = TextFileReader(fp_or_buf, **kwds)
>   File "/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in __init__
>     self._make_engine(self.engine)
>   File "/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py", line 1135, in _make_engine
>     self._engine = CParserWrapper(self.f, **self.options)
>   File "/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py", line 1917, in __init__
>     self._reader = parsers.TextReader(src, **kwds)
>   File "pandas/_libs/parsers.pyx", line 545, in pandas._libs.parsers.TextReader.__cinit__
> pandas.errors.EmptyDataError: No columns to parse from file
> multiprocessing.pool.RemoteTraceback:
> """
> Traceback (most recent call last):
>   File "/opt/conda/lib/python3.6/multiprocessing/pool.py", line 119, in worker
>     result = (True, func(*args, **kwds))
>   File "/opt/conda/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
>     return list(map(*args))
>   File "/EDTA/bin/TIR-Learner2.4/Module3/GetAllSeq.py", line 32, in GetListFromFile
>     f=open(file,"r+")
> FileNotFoundError: [Errno 2] No such file or directory: 'TIR-Learner_FinalAnn.gff3'
> """
>
> The above exception was the direct cause of the following exception:
>
> Traceback (most recent call last):
>   File "/EDTA/bin/TIR-Learner2.4/Module3/GetAllSeq.py", line 63, in <module>
>     pool.map(GetListFromFile,fileList) #shujun
>   File "/opt/conda/lib/python3.6/multiprocessing/pool.py", line 266, in map
>     return self._map_async(func, iterable, mapstar, chunksize).get()
>   File "/opt/conda/lib/python3.6/multiprocessing/pool.py", line 644, in get
>     raise self._value
> FileNotFoundError: [Errno 2] No such file or directory: 'TIR-Learner_FinalAnn.gff3'
> mv: cannot stat 'TIR-Learner/*FinalAnn*.gff3': No such file or directory
> mv: cannot stat 'TIR-Learner/*FinalAnn*.fa': No such file or directory
> Can't open ./TIR-Learner-Result/TIR-Learner_FinalAnn.fa: No such file or directory at /EDTA/util/rename_tirlearner.pl line 18.
> Warning: LOC list ordered_atriplex_hortensis_04Apr2019_hkF1T_namedcorrectly_clean_00.fasta.TIR.ext30.list is empty.
>
> Error: Error while loading sequenceCan't open ./TIR-Learner-Result/TIR-Learner_FinalAnn.gff3: No such file or directory.
> Warning: The TIR result file has 0 bp!
>
> Fri Jan 24 02:55:51 UTC 2020    Start to find Helitron candidates.
>
> Fri Jan 24 02:55:51 UTC 2020    Identify Helitron candidates from scratch.
>
> Fri Jan 24 03:49:16 UTC 2020    Finish finding Helitron candidates.
>
> Fri Jan 24 03:49:16 UTC 2020    Execution of EDTA_raw.pl is finished!
>
> ERROR: Raw TIR results not found in ordered_atriplex_hortensis_04Apr2019_hkF1T_namedcorrectly_clean_00.fasta.EDTA.raw/ordered_atriplex_hortensis_04Apr2019_hkF1T_namedcorrectly_clean_00.fasta.TIR.raw.fa at /EDTA/EDTA.pl line 285.
> cleaning up image```

Can't find label error

Dear Shujun,

Please see this error,

Dependency checking: All passed!
Can't find label ALL at /data/home/qiushi_volumn1/programs/EDTA/EDTA.pl line 118.

perl version info:
This is perl 5, version 26, subversion 2 (v5.26.2) built for x86_64-linux-thread-multi

Many thanks~
Your loyal fans

Crashing with TIR-Learner

Hi,

I copy and pasted the installation instructions from the README and am running the the script in the active EDTA environment. It seems that the EDTA.pl script chokes trying to use TIR-Learner. Looking at my output, all the correct folders and such are there. After crashing, the Helitron, MITE, and TIR folders are empty but the LTR folder is not. The only file in the parent output folder is genome.fasta.LTR.raw.fa.

Is there a way to run the Perl pipeline script but just not use TIR-Learner, or even just not call TIRs? I'm still interested in the other features, and even if I could just use EDTA for Helitrons, LTRs, MITEs, filtering, consensus calling, and repeat classifying I would be happy.

The lines before the crash start with what's seen in #2 (comment). Then it's a traceback starting from ~/bin/EDTA/bin/TIR-Learner1.12/Module1/Fullcov.py, line 52, in <module> ProcessHomology(genome_Name). After that, there's some cryptic errors including
cat: '*DTA-+-select.fa': No such file or directory
cat: '*-+-*-+-*.gff3': No such file or directory
There's a few more error traces after that, with each Traceback followed by various errors from files not being found by rm, cp, mv, cat.

  • TIR-Learner1.12/Module1 (above)
  • TIR-Learner1.12/Module1/Lowcomp_M1.py
  • TIR-Learner1.12/Module2/Lowcomp_M2.py
  • TIR-Learner1.12/

Lastly, in the last few lines before the crash, I get these lines which tell me that it certainly is a problem with TIR-Learner
FileNotFoundError: [Errno 2] No such file or directory: 'TIR-Learner_FinalAnn.gff3' mv: cannot stat 'TIR-Learner/*FinalAnn.gff3': No such file or directory mv: cannot stat 'TIR-Learner/*FinalAnn.fa': No such file or directory cp: cannot stat 'TIR-Learner-Result/TIR-Learner_FinalAnn.fa': No such file or directory Error: TIR results not found!

ERROR: Raw TIR results not found in genome.fasta.EDTA.raw/genome.fasta.TIR.raw.fa at ~bin/EDTA/EDTA.pl line 145.

While bug testing I've just been using the first two scaffolds of my genome. That file is attached.

Thanks!

PR-102_JGI_twoscafs.fasta.zip

EDTA ignoring the parameter -threads

Hello,
I noticed that blastx and TIR learner ignore the -threads settings, and take all the available threads.

EDIT: sorry, was my mistake, closing

Can't locate object method "end" via package "Thread::Queue

Hello,
I used EDTA but I got an error when using the following script:

perl /PARA/pp811/anaconda3/bin/EDTA/EDTA.pl -genome ref.fasta -cds CDS.fasta -anno 1 -evaluate 1

And then I got the following error output:

Can't locate object method "end" via package "Thread::Queue" at /PARA/pp811/anaconda3/bin/EDTA/bin/LTR_FINDER_parallel/LTR_FINDER_parallel line 115, <List> line 1. cat: ref.fasta.finder.combine.scn: No such file or directory Can't locate object method "end" via package "Thread::Queue" at /PARA/pp811/anaconda3/bin/EDTA/bin/LTR_retriever/bin/LTR.identifier.pl line 125. cp: cannot stat ref.fasta.mod.retriever.scn.adj': No such file or directory
awk: cmd. line:1: fatal: cannot open file ref.fasta.pass.list' for reading (No such file or directory) Warning: LOC list - is empty. Error: Error while loading sequencecp: cannot stat ref.fasta.LTRlib.fa': No such file or directory
cp: cannot stat `VF36.GPM.fasta.LTRlib.fa': No such file or directory
Error: LTR results not found!

ERROR: Raw LTR results not found in ref.fasta.EDTA.raw/ref.fasta.LTR.raw.fa at /PARA/pp811/anaconda3/bin/EDTA/EDTA.pl line 284.`

Of all the results I've gotten so far, only LTR raw file,both TIR and Helitron raw fasta files are empty.
Any help or suggestion will be appreciated.
Thanks!

RepeatModeler in conda

Hi, all

EDTA pipeline rely on the RepeatModeler in the conda, but it have a known issue, the conda version seems cannot produce the consensi.fa.
Dfam-consortium/RepeatModeler#38

If you want to find TE in your genome by RepeatModeler, please install the software by yourself, assign the -repeatmodeler and -repeatmasker to the install path, and then use the consensi.fa.classified as your RepeatModerler raw fa.

Running EDTA efficiently on a cluster

Hi Shujun,

Just a quick question. I have completed some initial tests on a small fraction (~150Mb) of a ~5Gb genome and am ready to give the real thing a try! However, as I'm sure is the case for many users, I have to tactically dodge run-time limits whilst maximising the resources I can use on the various queues on my cluster. In my case for example I can run a job for 24 hours with a lot of resources, or a job for 10 days with limited resources. So one question I have is:

Can I independently and simultaneously run the TE library steps for tir, ltr and helitron (i.e. divide an conquer) into the same output folder and then use these for the final steps in a later job? Or is there something that would get confused if I did this?

Also if you have any other tips for maximising efficiency when constrained by cluster resources I'd be very happy to hear them. Specifically if you could give some guidance as to whether parallelism or memory are more important for each step that would already be very helpful!

Best wishes, and thanks again for an awesome tool and paper!

Dan

Fail on identification of TIRs (EDTA v 1.8) with the step-by-step installation

Hello,
I'm currently trying to run EDTA on the cluster of my laboratory, and I encounter an issue that looks similar to the one listed below in the EDTA issues, except I'm running on the 1.8 version. i installed it through the step by step conda installation (for some reason, the one line conda installation doesn't want to work on my devices).
I encounter this error :

Mon Feb 10 17:57:54 CET 2020	EDTA_raw: Check dependencies, prepare working directories.

Mon Feb 10 17:58:14 CET 2020	Start to find LTR candidates.

Mon Feb 10 17:58:14 CET 2020	Identify LTR retrotransposon candidates from scratch.

Mon Feb 10 18:39:20 CET 2020	Finish finding LTR candidates.

Mon Feb 10 18:39:20 CET 2020	Start to find TIR candidates.

Mon Feb 10 18:39:20 CET 2020	Identify TIR candidates from scratch.

Species: others
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Traceback (most recent call last):
  File "/beegfs/data/tkastylevsky/programs/EDTA/bin/TIR-Learner2.4/Module3_New/getDataset.py", line 11, in <module>
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/sklearn/preprocessing/__init__.py", line 8, in <module>
    from .data import Binarizer
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 18, in <module>
    from scipy import stats
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/scipy/stats/__init__.py", line 348, in <module>
    from .stats import *
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/scipy/stats/stats.py", line 177, in <module>
    from . import distributions
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/scipy/stats/distributions.py", line 13, in <module>
    from . import _continuous_distns
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/scipy/stats/_continuous_distns.py", line 15, in <module>
    from scipy._lib._numpy_compat import broadcast_to
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/scipy/_lib/_numpy_compat.py", line 10, in <module>
    from numpy.testing.nosetester import import_nose
ModuleNotFoundError: No module named 'numpy.testing.nosetester'
cat: '*-+-DTA.fa': No such file or directory
cat: '*-+-DTC.fa': No such file or directory
cat: '*-+-DTH.fa': No such file or directory
cat: '*-+-DTM.fa': No such file or directory
cat: '*-+-DTT.fa': No such file or directory
cat: '*-+-NonTIR.fa': No such file or directory
cat: '*-+-*-+-*.gff3': No such file or directory
rm: cannot remove '*-+-*-+-*.gff3': No such file or directory
Traceback (most recent call last):
  File "/beegfs/data/tkastylevsky/programs/EDTA/bin/TIR-Learner2.4/Module3_New/CombineAll.py", line 75, in <module>
    f_m3=removeDupinSingle("%s.gff3"%(genome_Name+spliter+"Module3"))
  File "/beegfs/data/tkastylevsky/programs/EDTA/bin/TIR-Learner2.4/Module3_New/CombineAll.py", line 57, in removeDupinSingle
    f=pd.read_csv(file,header=None,sep="\t") #shujun
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 1891, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 532, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/beegfs/data/tkastylevsky/programs/EDTA/bin/TIR-Learner2.4/Module3/GetAllSeq.py", line 32, in GetListFromFile
    f=open(file,"r+")
FileNotFoundError: [Errno 2] No such file or directory: 'TIR-Learner_FinalAnn.gff3'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/beegfs/data/tkastylevsky/programs/EDTA/bin/TIR-Learner2.4/Module3/GetAllSeq.py", line 63, in <module>
    pool.map(GetListFromFile,fileList) #shujun
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 288, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 670, in get
    raise self._value
FileNotFoundError: [Errno 2] No such file or directory: 'TIR-Learner_FinalAnn.gff3'
mv: cannot stat 'TIR-Learner/*FinalAnn*.gff3': No such file or directory
mv: cannot stat 'TIR-Learner/*FinalAnn*.fa': No such file or directory
Can't open ./TIR-Learner-Result/TIR-Learner_FinalAnn.fa: No such file or directory at /beegfs/data/tkastylevsky/programs/EDTA/util/rename_tirlearner.pl line 18.
Warning: LOC list galgal6_chr1.fa.mod.TIR.ext30.list is empty.

Error: Error while loading sequenceCan't open ./TIR-Learner-Result/TIR-Learner_FinalAnn.gff3: No such file or directory.
Warning: The TIR result file has 0 bp!

Mon Feb 10 21:29:38 CET 2020	Start to find Helitron candidates.

Mon Feb 10 21:29:38 CET 2020	Identify Helitron candidates from scratch.

Tue Feb 11 01:27:12 CET 2020	Finish finding Helitron candidates.

Tue Feb 11 01:27:12 CET 2020	Execution of EDTA_raw.pl is finished!

ERROR: Raw TIR results not found in galgal6_chr1.fa.mod.EDTA.raw/galgal6_chr1.fa.mod.TIR.raw.fa at /beegfs/data/tkastylevsky/programs/EDTA/EDTA.pl line 368.

thanks in advance,

dirname: missing operand

I can ran successfully this two commands:
perl $EDTA_raw --genome $TAIR10_mod -species others -type ltr --overwrite 0 --threads 8
perl $EDTA_raw --genome $TAIR10_mod -species others -type helitron --overwrite 0 --threads 8

But this one:
perl $EDTA_raw --genome $TAIR10_mod -species others -type tir --overwrite 0 --threads 8

Gives the following error:


EDTA_raw: Check dependencies, prepare working directories.
Start to find LTR candidates.

Existing result file Arabidopsis_thaliana.TAIR10.dna.toplevel_14lines.fa.mod.LTR.raw.fa found! Will keep this file without rerunning this module.
 Please specify -overwrite 1 if you want to rerun this module.
Finish finding LTR candidates.
Start to find TIR candidates.
 Identify TIR candidates from scratch.

Species: others
dirname: missing operand
Try 'dirname --help' for more information.
Can't open ./TIR-Learner-Result/TIR-Learner_FinalAnn.fa: No such file or directory at EDTA/util/rename_tirlearner.pl line 18.
Warning: LOC list Arabidopsis_thaliana.TAIR10.dna.toplevel_14lines.fa.mod.TIR.ext30.list is empty.
Error: Error while loading sequenceCan't open ./TIR-Learner-Result/TIR-Learner_FinalAnn.gff3: No such file or directory.
Warning: The TIR result file has 0 bp!

Any suggestions?
Thank you!

All transposons "unknown"

Hello,
in the file.fa.EDTA.TElib.fa, virtually all transposons are labelled as "unknown":
16712 are unknown
48 are Gypsy
Presuming my study organism is not having completely strange transposons, is it the kind of expected statistics?

Thank you

Identifying flanking LTR pairs

Hello,

I've used EDTA to successfully annotate and mask my plant genome (via RepeatMasker). However, I am also interested in the actual flanking LTR pairs for each LTR retrotransposon.

I know that LTR_finder and LTR harvest report these on their own. By running them individually on a segment of my genome, I'm able to only regenerate some of these pairs (maybe less than 10% of the total unique types found by EDTA). And furthermore, many of them do not match the reported positions found by running the full EDTA pipeline.

What would be the best way to find the corresponding LTR pairs for each LTR subfamily reported?

Much appreciated,

Bryan

MITE-Hunter produces no results

Shujun,

Thank you for updating EDTA. I am using 1.3 on a maize genome and the MITE step took a long time (~11 days). The problem is that no MITE raw sequences were output after TIR and MITE runs. Now the running is at Helitron. I will update after the run is finished.

-Sanzhen

EDTA run on a large genome

Hello,

I'd like to try out the EDTA pipeline to construct a repeat library for a large (20Gbp) genome assembly. Would you expect this to be scalable to a genome of this size? Would it be possible to partition the genome and EDTA separately on each partition of the assembly?

Any tips or guidance would be much appreciated.
Thank you!
Lauren

Combining existing TE library with EDTA results

I think I closed this a bit too early, I do have a question that isn't discussed in #8. If we plan on including homology-based TEs from RepBase or Dfam as well as the structure-based TEs from EDTA, do you suggest including the RepBase/Dfam libraries in the -curatedlib option of the EDTA run? Or should we run EDTA and then concatenate with RepBase/Dfam results?

Originally posted by @Neato-Nick in #18 (comment)

Specifying RepeatMasker query species

Hi,

I'm annotating a genome pretty distant from homo sapiens. Checking the RM_ output folder, it looks like the call to RepeatMasker just queries this as the default species ("The query species was assumed to be homo" in the RM_/.fasta.tbl output in the *.fasta.EDTA.final/ folder). Is there any way to change this in my call to EDTA so I can most effectively use a homology-based TE calling method? Alternatively, I could just run RepeatMasker myself at the end using *.fasta.EDTA.TElib.fa as a custom library

Job is always killed

Hi, Shujun,

I am testing edta on our school's computer. However, the job is always killed. Here is my script:
module load edta/20190108
module load ltrretriever/1.6

EDTA.pl -genome Zm-I-REFERENCE-FL-1.0.fa -species Maize -threads 2

Here is the error message:
########################################################

Extensive de-novo TE Annotator (EDTA) v1.3
Shujun Ou ([email protected])

########################################################

Tue Aug 6 13:57:01 EDT 2019 Dependency checking:
All passed!
Tue Aug 6 13:57:13 EDT 2019 Obtain raw TE libraries using various structure-based programs:
sh: line 1: 32154 Killed /apps/edta/20190108/edta/bin/genometools-1.5.10/bin/gt suffixerator -db Zm-I-REFERENCE-FL-1.0.fa -indexname Zm-I-REFERENCE-FL-1.0.fa -
Can't locate object method "end" via package "Thread::Queue" at /apps/edta/20190108/edta/bin/LTR_FINDER_parallel/LTR_FINDER_parallel line 115, line 10732.
cat: Zm-I-REFERENCE-FL-1.0.fa.finder.combine.scn: No such file or directory
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: Ignoring FASTA modifier(s) found because the input was not expected to have any.
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: Ignoring FASTA modifier(s) found because the input was not expected to have any.
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: Ignoring FASTA modifier(s) found because the input was not expected to have any.
grep: Zm-I-REFERENCE-FL-1.0.fa.retriever.scn: No such file or directory
Argument "" isn't numeric in numeric gt (>) at /apps/edta/20190108/edta/bin/LTR_retriever/LTR_retriever line 355.

ERROR: No candidate is found in the file(s) you specified.

cp: cannot stat ‘Zm-I-REFERENCE-FL-1.0.fa.LTRlib.fa’: No such file or directory
cp: cannot stat ‘Zm-I-REFERENCE-FL-1.0.fa.LTRlib.fa’: No such file or directory
Error: LTR results not found!

ERROR: Raw LTR results not found in Zm-I-REFERENCE-FL-1.0.fa.EDTA.raw/Zm-I-REFERENCE-FL-1.0.fa.LTR.raw.fa at /apps/edta/20190108/edta/EDTA.pl line 170.
slurmstepd: error: Detected 1 oom-kill event(s) in step 40042225.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

I found in LTR folder Zm-I-REFERENCE-FL-1.0.fa.harvest.scn and Zm-I-REFERENCE-FL-1.0.fa.rawLTR.scn are empty.

Looking forward to your reply!

Best,

Ying

ERROR: Stage 1 library not found in chr1.fa.mod.EDTA.combine/chr1.fa.mod.LTR.TIR.Helitron.fa.stg1 at EDTA.pl line 384.

Hello,

I am trying to use EDTA in order to annotate an avian genome(as a test I do it on a single chromosome), but i keep running into an error.
I have installed it with conda, following the script you have on github.
here is the copy of what I execute in order to run your script :

PATH=$PATH:/home/tkastylevsky/EDTA
cd
cd /home/tkastylevsky/FASTA_files/EDTA/gallus_gallus/chr1/
EDTA.pl -genome chr1.fa -anno 1  -force 1

(I tried to add the force 1 based on a solved issue on this github but it didn't help)

and this is what I get (some of it is in french, sorry, feel free to ask me if you need any translation, at first glance it seemed to me that everything was roughly understandable) :

########################################################

Extensive de-novo TE Annotator (EDTA) v1.7.6
Shujun Ou ([email protected])

########################################################

mercredi 29 janvier 2020, 18:08:10 (UTC+0100) Dependency checking:
All passed!
mercredi 29 janvier 2020, 18:08:20 (UTC+0100) Obtain raw TE libraries using various structure-based programs:
At least 1 parameter mandatory:

  1. Input fasta file: --genome

Obtain raw TE libraries using various structure-based programs
perl EDTA_raw.pl [options]
--genome [File] The genome FASTA
--species [rice|maize|others] Specify the species for identification of TIR candidates. Default: others
--type [ltr|tir|helitron|all] Specify which type of raw TE candidates you want to get. Default: all
--overwrite [0|1] If previous results are found, decide to overwrite (1, rerun) or not (0, default).
--threads|-t [int] Number of theads to run this script. Default: 4
--help|-h Display this help info

cat: chr1.fa.mod.LTR.intact.fa: Aucun fichier ou dossier de ce type
cat: chr1.fa.mod.TIR.intact.fa: Aucun fichier ou dossier de ce type
cat: chr1.fa.mod.Helitron.intact.fa: Aucun fichier ou dossier de ce type
cat: chr1.fa.mod.LTR.intact.fa.gff3: Aucun fichier ou dossier de ce type
cat: chr1.fa.mod.TIR.intact.fa.gff: Aucun fichier ou dossier de ce type
cat: chr1.fa.mod.Helitron.intact.fa.gff: Aucun fichier ou dossier de ce type

perl bed2gff.pl EDTA.TE.combo.bed

mv: impossible d'évaluer 'chr1.fa.mod.EDTA.intact.bed.gff': Aucun fichier ou dossier de ce type
cp: impossible d'évaluer 'chr1.fa.mod.EDTA.intact.gff': Aucun fichier ou dossier de ce type
mercredi 29 janvier 2020, 18:08:20 (UTC+0100) Obtain raw TE libraries finished.
All intact TEs found by EDTA:
chr1.fa.mod.EDTA.intact.fa
chr1.fa.mod.EDTA.intact.gff

mercredi 29 janvier 2020, 18:08:20 (UTC+0100) Perform EDTA advcance filtering for raw TE candidates and generate the stage 1 library:

Genome file chr1.fa.mod not exists!

Perform EDTA basic and advcanced filterings for raw TE candidates and generate the stage 1 library
perl EDTA_processF.pl [options]
-genome [File] The genome FASTA
-ltr [File] The raw LTR library FASTA
-tir [File] The raw TIR library FASTA
-helitron [File] The raw Helitron library FASTA
-mindiff_ltr [float] The minimum fold difference in richness between LTRs and contaminants (default: 1)
-mindiff_tir [float] The minimum fold difference in richness between TIRs and contaminants (default: 1)
-mindiff_hel [float] The minimum fold difference in richness between Helitrons and contaminants (default: 4)
-repeatmasker [path] The directory containing RepeatMasker (default: read from ENV)
-blast [path] The directory containing Blastn (default: read from ENV)
-protlib [File] Protein-coding aa sequences to be removed from TE candidates. (default lib: alluniRefprexp082813 (plant))
You may use uniprot_sprot database available from here:
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/
-threads|-t [int] Number of theads to run this script
-help|-h Display this help info

ERROR: Stage 1 library not found in chr1.fa.mod.EDTA.combine/chr1.fa.mod.LTR.TIR.Helitron.fa.stg1 at /home/tkastylevsky/EDTA/EDTA.pl line 384.

I know, through other annotation methods (repeatmodeler) that there are LTR, TIR and helitrons on this chromosome.

Thank you in advance,

IndexError: list index out of range

Dear Shujun,

When I run the EDTA with EDTA/EDTA.pl -genome non-redundant.shortname.fa -species others -step all -t 28, but it got some error when identify TIR candidates as following:

EDTA/bin/TIR-Learner2.4/Module2/RunGRF.py", line 79, in <module>
    if (len(str(records[0].seq))>int(length)+500):
IndexError: list index out of range

So how I can fix this error, Thank you!

Use of uninitialized value within @ARGV

Dear Shujun
I managed to generate the raw files for LTR, TIR, MITE (Copy of TIR) and Helitrons. I am getting the following error while running EDTA_processF.pl.

/usr/local_sbs/source_files/EDTA/EDTA_processF.pl -genome HaploidAssemblyPilonPolishedCleaned.fasta -ltr HaploidAssemblyPilonPolishedCleaned.fasta.EDTA.raw/HaploidAssemblyPilonPolishedCleaned.fasta.LTR.raw.fa -tir HaploidAssemblyPilonPolishedCleaned.fasta.EDTA.raw/HaploidAssemblyPilonPolishedCleaned.fasta.TIR.raw.fa -helitron HaploidAssemblyPilonPolishedCleaned.fasta.EDTA.raw/HaploidAssemblyPilonPolishedCleaned.fasta.Helitron.raw.fa -mite HaploidAssemblyPilonPolishedCleaned.fasta.EDTA.raw/HaploidAssemblyPilonPolishedCleaned.fasta.MITE.raw.fa

Use of uninitialized value within @argv in pattern match (m//) at /usr/local/local_sbs/source_files/EDTA/util/cleanup_nested.pl line 41.
Use of uninitialized value $blastplus in string eq at /usr/local/local_sbs/source_files/EDTA/util/cleanup_nested.pl line 49.

Could you help me with this?

The picture shows the final files that were generated.

Screenshot 2019-09-18 at 5 09 03 PM

RepeatMasker Classification by EDTA lib

Hi Shujun,

I use the genome.fa.EDTA.TElib.fa produced by EDTA.pl as lib to run the RpeatMasker. But the result clafficication only have LTR elements and DNA elements without specific classfication (such as LTR/Copia). How can I get more detailed repeat classicication by RpeatMasker. Do I need to run the RepeatMasker in homo mode (set -species), then combine the two lib as final result?

Here is the command and result.

The first 10 lines of genome.fa.EDTA.TElib.fa

>TE_00000000#DNA/DTH
>TE_00000001#DNA/Helitron
>TE_00000002#DNA/DTC
>TE_00000003#DNA/Helitron
>TE_00000004#DNA/DTT
>TE_00000005#DNA/Helitron
>TE_00000006#DNA/Helitron
>TE_00000007#DNA/Helitron
>TE_00000008#DNA/DTT
>TE_00000009#DNA/Helitron

RepeatMasker

RepeatMasker -pa 24 -lib genome.fa.EDTA.TElib.fa -dir ./ -xsmall -gff -e ncbi -q -no_is -norna -nolow -div 40 -cutoff 225 genome.fa

==================================================
file name: genome.fa
sequences:           125
total length:  336324563 bp  (336315300 bp excl N/X-runs)
GC level:         33.22 %
bases masked:  189970773 bp ( 56.48 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:                0            0 bp    0.00 %
      ALUs            0            0 bp    0.00 %
      MIRs            0            0 bp    0.00 %

LINEs:                0            0 bp    0.00 %
      LINE1           0            0 bp    0.00 %
      LINE2           0            0 bp    0.00 %
      L3/CR1          0            0 bp    0.00 %

LTR elements:     97559     83216326 bp   24.74 %
      ERVL            0            0 bp    0.00 %
      ERVL-MaLRs      0            0 bp    0.00 %
      ERV_classI      0            0 bp    0.00 %
      ERV_classII     0            0 bp    0.00 %

DNA elements:    203839     83514886 bp   24.83 %
     hAT-Charlie      0            0 bp    0.00 %
     TcMar-Tigger     0            0 bp    0.00 %

Unclassified:    123082     29657324 bp    8.82 %

Total interspersed repeats:196388536 bp   58.39 %


Small RNA:            0            0 bp    0.00 %

Satellites:           0            0 bp    0.00 %
Simple repeats:       0            0 bp    0.00 %
Low complexity:       0            0 bp    0.00 %

==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element


The query species was assumed to be homo
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127

run with rmblastn version 2.6.0+
The query was compared to classified sequences in "genome.fa.EDTA.TElib.fa"

Cheers,
Zhigui

EDTA v1.5 fail on a small dataset

Shujun, I tested the v1.5 with a small data set. It showed errors as:

########################################################

Extensive de-novo TE Annotator (EDTA) v1.5
Shujun Ou ([email protected])

########################################################

Mon Aug 26 12:33:52 CDT 2019 Dependency checking:
All passed!
Mon Aug 26 12:33:57 CDT 2019 Obtain raw TE libraries using various structure-based programs:
Mon Aug 26 12:33:57 CDT 2019 EDTA_raw: Check files and dependencies, prepare working directories.

Mon Aug 26 12:33:57 CDT 2019 Start to find LTR candidates.

Mon Aug 26 12:33:57 CDT 2019 Identify LTR retrotransposon candidates from scratch.

    Usage: perl cleanup.pl -f sample.fa [options] > sample.cln.fa 
Options:
	-misschar	n	Define the letter representing unknown sequences; case insensitive; default: n
	-Nscreen	[0|1]	Enable (1) or disable (0) the -nc parameter; default: 1
	-nc		[int]	Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
	-nr		[0-1]	Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
	-minlen		[int]	Minimum sequence length filter after clean up; default: 100 (bp)
	-cleanN		[0|1]	Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
	-trf		[0|1]	Enable (1) or disable (0) tandem repeat finder (trf); default: 1
	-trf_path	path	Path to the trf program

cp: cannot stat ‘TF05-1v012.fasta.mod.retriever.scn.adj’: No such file or directory
cp: cannot stat ‘TF05-1v012.fasta.LTRlib.fa’: No such file or directory
cp: cannot stat ‘TF05-1v012.fasta.LTRlib.fa’: No such file or directory
Error: LTR results not found!

ERROR: Raw LTR results not found in TF05-1v012.fasta.EDTA.raw/TF05-1v012.fasta.LTR.raw.fa at /homes/liu3zhen/.conda/envs/EDTA3/EDTA/EDTA.pl line 176.

Originally posted by @liu3zhenlab in #12 (comment)

Wildcard in `TIR-Learner2.4.sh` expands to too many elements and cause an error to cp

Hello again :-), I have run into the following error:

243: EDTA/bin/TIR-Learner2.4/TIR-Learner2.4.sh: cp: Argument list too long

the line that dropped the error message in the script is

cp -r $genomeName/$genomeName* temp/

where variable genomeName is statically assigned to TIR-Learner at the very beginning of the script.

The temp/ directory is now empty, which I am not sure if it's a problem or not.

Issue with identifying TIR candidates

Hello,
I am trying to run EDTA in a conda environment and the setup is well done but at TIR identification step I have the following error:

Tue Jan 28 12:37:40 CET 2020    Finish finding LTR candidates.

Tue Jan 28 12:37:40 CET 2020    Start to find TIR candidates.

Tue Jan 28 12:37:40 CET 2020    Identify TIR candidates from scratch.

Species: others


/mnt/vol2/conda/miniconda3/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is de
precated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])

terminate called after throwing an instance of 'std::system_error'
  what():  terminate called after throwing an instance of 'terminate called after throwing an instance of 'Resource temporarily unavailablestd::system_errorstd::system_error
'
'
  what():  Resource temporarily unavailable
terminate called after throwing an instance of 'std::system_error'

Any clues about the possible solution to this error?

suggest to add `-n EDTA` for `conda install`

My conda version is Miniconda2 4.4.10, whose base use python2. conda install python=3.6 damaged the env of base, following import error for conda:

$ conda -h
Traceback (most recent call last):
  File "~/conda/bin/conda", line 7, in <module>
    from conda.cli import main
ImportError: No module named conda.cli

So it is better to not install python=3.6 in conda's base.

Is MITE-Hunter still there?

Hello,
the bioArxiv paper describes the usef of MITE-Hunter but the new figure on github suggests it's not there any more. If I understand correctly it has been disabled for the moment, right?

Short contig issue of TIR-Learner.sh

Hello,
I am rerunning the last push in the same folder and get errors, here is the STDOUT and STDERR
This is a follow-up of this issue:
#10

./EDTA/EDTA.pl -genome Avaga.Masurca.Graal.min5500.fa -species others -step all -t 48 2>&1 |tee EDTA.log

########################################################
##### Extensive de-novo TE Annotator (EDTA) v1.5    ####
##### Shujun Ou ([email protected])             ####
########################################################



Mo Aug 19 10:52:23 CEST 2019	Dependency checking:
				All passed!
Mo Aug 19 10:52:33 CEST 2019	Obtain raw TE libraries using various structure-based programs: 
Mo Aug 19 10:52:33 CEST 2019	EDTA_raw: Check files and dependencies, prepare working directories.

Mo Aug 19 10:52:33 CEST 2019	Start to find LTR candidates.

Mo Aug 19 10:52:33 CEST 2019	Existing result file Avaga.Masurca.Graal.min5500.fa.LTRlib.fa found! Will keep this file without rerunning this module.
	Please specify -overwrite 1 if you want to rerun this module.

Mo Aug 19 10:52:33 CEST 2019	Finish finding LTR candidates.

Mo Aug 19 10:52:33 CEST 2019	Start to find TIR candidates.

Mo Aug 19 10:52:33 CEST 2019	Identify TIR candidates from scratch.

Species: others
/media/urbe/MyBDrive/Alessandro/TR_annotation/EDTA/bin/TIR-Learner1.22/TIR-Learner.sh: 95: [: others: unexpected operator
/media/urbe/MyBDrive/Alessandro/TR_annotation/EDTA/bin/TIR-Learner1.22/TIR-Learner.sh: 95: [: others: unexpected operator
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/urbe/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/urbe/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/media/urbe/MyBDrive/Alessandro/TR_annotation/EDTA/bin/TIR-Learner1.22/Module3_New/getDataset2.py", line 107, in Predict
    model = load_model(path+"/Module3_New/"+'CNN0724.h5')
  File "/home/urbe/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/saving.py", line 249, in load_model
    optimizer_config, custom_objects=custom_objects)
  File "/home/urbe/.local/lib/python3.6/site-packages/tensorflow/python/keras/optimizers.py", line 838, in deserialize
    printable_module_name='optimizer')
  File "/home/urbe/.local/lib/python3.6/site-packages/tensorflow/python/keras/utils/generic_utils.py", line 194, in deserialize_keras_object
    return cls.from_config(cls_config)
  File "/home/urbe/.local/lib/python3.6/site-packages/tensorflow/python/keras/optimizers.py", line 159, in from_config
    return cls(**config)
  File "/home/urbe/.local/lib/python3.6/site-packages/tensorflow/python/keras/optimizers.py", line 471, in __init__
    super(Adam, self).__init__(**kwargs)
  File "/home/urbe/.local/lib/python3.6/site-packages/tensorflow/python/keras/optimizers.py", line 68, in __init__
    'passed to optimizer: ' + str(k))
TypeError: Unexpected keyword argument passed to optimizer: name
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/media/urbe/MyBDrive/Alessandro/TR_annotation/EDTA/bin/TIR-Learner1.22/Module3_New/getDataset2.py", line 131, in <module>
    d = pool.map(Predict,files)
  File "/home/urbe/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/urbe/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
TypeError: Unexpected keyword argument passed to optimizer: name
cat: '*-+-DTA.fa': No such file or directory
cat: '*-+-DTC.fa': No such file or directory
cat: '*-+-DTH.fa': No such file or directory
cat: '*-+-DTM.fa': No such file or directory
cat: '*-+-DTT.fa': No such file or directory
cat: '*-+-NonTIR.fa': No such file or directory
cat: '*-+-*-+-*.gff3': No such file or directory
rm: cannot remove '*-+-*-+-*.gff3': No such file or directory
Traceback (most recent call last):
  File "/media/urbe/MyBDrive/Alessandro/TR_annotation/EDTA/bin/TIR-Learner1.22/Module3_New/CombineAll.py", line 90, in <module>
    keep=removeIRFhomo("%s.gff3"%(genome_Name+spliter+dataset),remove,"%sClean.gff3"%(genome_Name+spliter+dataset+spliter))
  File "/media/urbe/MyBDrive/Alessandro/TR_annotation/EDTA/bin/TIR-Learner1.22/Module3_New/CombineAll.py", line 76, in removeIRFhomo
    f=pd.read_csv(file,header=None,sep="\t")
  File "/home/urbe/anaconda3/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/urbe/anaconda3/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 429, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/urbe/anaconda3/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/home/urbe/anaconda3/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 1122, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/urbe/anaconda3/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 1853, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 545, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
Traceback (most recent call last):
  File "/media/urbe/MyBDrive/Alessandro/TR_annotation/EDTA/bin/TIR-Learner1.22/Module3/GetAllSeq.py", line 62, in <module>
    file=open(f,"r+")
FileNotFoundError: [Errno 2] No such file or directory: 'TIR-Learner_FinalAnn.gff3'
mv: cannot stat 'TIR-Learner/*FinalAnn.gff3': No such file or directory
mv: cannot stat 'TIR-Learner/*FinalAnn.fa': No such file or directory
Can't open ./TIR-Learner-Result/TIR-Learner_FinalAnn.fa: No such file or directory at /media/urbe/MyBDrive/Alessandro/TR_annotation/EDTA/util/rename_tirlearner.pl line 18.
Warning: LOC list Avaga.Masurca.Graal.min5500.fa.TIR.ext30.list is empty.
Warning: The TIR result file has 0 bp!

Mo Aug 19 10:52:56 CEST 2019	Start to find MITE candidates.

Mo Aug 19 10:52:56 CEST 2019	Existing result file Avaga.Masurca.Graal.min5500.fa.MITE.raw.fa found! Will keep this file without rerunning this module.
	Please specify -overwrite 1 if you want to rerun this module.

Mo Aug 19 10:52:56 CEST 2019	Finish finding MITE candidates.

Mo Aug 19 10:52:56 CEST 2019	Start to find Helitron candidates.

Mo Aug 19 10:52:56 CEST 2019	Existing result file Avaga.Masurca.Graal.min5500.fa.Helitron.raw.fa found! Will keep this file without rerunning this module.
	Please specify -overwrite 1 if you want to rerun this module.

Mo Aug 19 10:52:56 CEST 2019	Finish finding Helitron candidates.

TIR-Learner fails due to the lack of intact elements in some sequences

Hello (it's me again sorry),

following issue #14
I have a machine where I thought EDTA was running fine but it seems to work or not depending of the genome fasta provided. Here is what is happening with a fasta that seems to cause an error
I have removed any scaffolds below 5500 bp. The RepeatMasker and RepeatModeler used are not the ones from conda

Mon Oct  7 20:13:39 CEST 2019	Dependency checking:
				All passed!
Mon Oct  7 20:14:01 CEST 2019	Obtain raw TE libraries using various structure-based programs: 
Mon Oct  7 20:14:01 CEST 2019	EDTA_raw: Check files and dependencies, prepare working directories.

Mon Oct  7 20:14:01 CEST 2019	Start to find LTR candidates.

Mon Oct  7 20:14:01 CEST 2019	Identify LTR retrotransposon candidates from scratch.

Mon Oct  7 20:21:12 CEST 2019	Finish finding LTR candidates.

Mon Oct  7 20:21:12 CEST 2019	Start to find TIR candidates.

Mon Oct  7 20:21:12 CEST 2019	Identify TIR candidates from scratch.

Species: others
rm: cannot remove './TIR-Learner/*': No such file or directory
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/lege/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/lege/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/mnt/sdc1/Alessandro/TR_2013/EDTA/bin/TIR-Learner1.23/Module3_New/getDataset2.py", line 109, in Predict
    predicted_labels = model.predict(np.stack(prefeature))
  File "<__array_function__ internals>", line 6, in stack
  File "/home/lege/.local/lib/python3.6/site-packages/numpy/core/shape_base.py", line 421, in stack
    raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/sdc1/Alessandro/TR_2013/EDTA/bin/TIR-Learner1.23/Module3_New/getDataset2.py", line 130, in <module>
    d = pool.map(Predict,files)
  File "/home/lege/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/lege/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
ValueError: need at least one array to stack
cat: '*-+-DTA.fa': No such file or directory
cat: '*-+-DTC.fa': No such file or directory
cat: '*-+-DTH.fa': No such file or directory
cat: '*-+-DTM.fa': No such file or directory
cat: '*-+-DTT.fa': No such file or directory
cat: '*-+-NonTIR.fa': No such file or directory
cat: '*-+-*-+-*.gff3': No such file or directory
rm: cannot remove '*-+-*-+-*.gff3': No such file or directory
Traceback (most recent call last):
  File "/mnt/sdc1/Alessandro/TR_2013/EDTA/bin/TIR-Learner1.23/Module3_New/CombineAll.py", line 90, in <module>
    keep=removeIRFhomo("%s.gff3"%(genome_Name+spliter+dataset),remove,"%sClean.gff3"%(genome_Name+spliter+dataset+spliter))
  File "/mnt/sdc1/Alessandro/TR_2013/EDTA/bin/TIR-Learner1.23/Module3_New/CombineAll.py", line 76, in removeIRFhomo
    f=pd.read_csv(file,header=None,sep="\t")
  File "/home/lege/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 685, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/lege/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 457, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/home/lege/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/home/lege/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1135, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/lege/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1917, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 545, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/lege/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/lege/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/mnt/sdc1/Alessandro/TR_2013/EDTA/bin/TIR-Learner1.23/Module3/GetAllSeq.py", line 32, in GetListFromFile
    f=open(file,"r+")
FileNotFoundError: [Errno 2] No such file or directory: 'TIR-Learner_FinalAnn.gff3'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/sdc1/Alessandro/TR_2013/EDTA/bin/TIR-Learner1.23/Module3/GetAllSeq.py", line 63, in <module>
    pool.map(GetListFromFile,fileList) #shujun
  File "/home/lege/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/lege/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
FileNotFoundError: [Errno 2] No such file or directory: 'TIR-Learner_FinalAnn.gff3'
mv: cannot stat 'TIR-Learner/*FinalAnn.gff3': No such file or directory
mv: cannot stat 'TIR-Learner/*FinalAnn.fa': No such file or directory
Can't open ./TIR-Learner-Result/TIR-Learner_FinalAnn.fa: No such file or directory at /mnt/sdc1/Alessandro/TR_2013/EDTA/util/rename_tirlearner.pl line 18.
Warning: LOC list long_scaffolds.fa.TIR.ext30.list is empty.
Warning: The TIR result file has 0 bp!

Mon Oct  7 20:57:12 CEST 2019	Start to find MITE candidates.

Mon Oct  7 20:57:12 CEST 2019	Identify MITE candidates from scratch.

Mon Oct  7 20:57:12 CEST 2019	Warning: Because MITE-Hunter is too slow and only contribute limited new TIR candidates, it is taken down temporary until a better solution is found.

As a temporary fix, the TIR-Learner is used to mock the MITE-Hunter result. Please run -type tir first.

Error: MITE results not found!

ERROR: Raw TIR results not found in long_scaffolds.fa.EDTA.raw/long_scaffolds.fa.TIR.raw.fa at ./EDTA/EDTA.pl line 177.

the fasta file can be sent to you if you would like to investigate.
Thanks a lot

"The RMblast engine is not installed in RepeatMasker!" When specifying conda env installation --prefix

Hi Shujun

I have been trying to install EDTA on my server but I have an annoying situation of a storage quota on my home directory meaning that the default location for the conda env isn't big enough to complete the installation. I am trying to get around it using:

conda create --prefix /scratch/djeffrie/EDTAenv

The installation seems to work fine. However when I run the pipeline I get the error:

The RMblast engine is not installed in RepeatMasker!

I see some issues for TIR_retriever with the same error but I can't figure out if its the same problem or not. I followed the suggestion [here]
(oushujun/LTR_retriever#43) of running

RepeatMasker -e ncbi -q -pa 1 -no_is -norna -nolow dummy060817.fa.$rand -lib dummy060817.fa.$rand

but I didn't get the expected output relating to the taxonomy data file, I got the error

RepeatMasker::setspecies: Could not find user specified library dummy060817.fa..

Would you have any solutions for how to get round this? Perhaps its a problem of using the --prefix argument? Or maybe just the server?

Best,

Dan

cannot stat 'ref.fa.LTR.intact.fa.gff3': No such file or directory

Hello,

I have yesterday started the EDTA pipeline, and I am very excited. However, I get the error that certain LTR files are not present, after 1 hour of run time. Do you know what is going on? I call the script as follows:
perl /data/modules/python/python-anaconda2-2019.10-EDTA/envs/EDTA/share/EDTA/EDTA.pl -genome ref.fa -species others -step all -curatedlib library7birds.fa -sensitive -repeatmasker /data/biosoftware/RepeatMasker/RepeatMasker/ 1 -anno 1 -evaluate 1 -t 15

The input library and genome are soft links (ln -s).

I then get the following error output:

perl rename_LTR.pl genome.fa target_sequence.fa LTR_retriever.defalse

cp: cannot stat 'ref.fa.LTR.intact.fa.gff3': No such file or directory
cp: cannot stat 'ref.fa.LTRlib.fa': No such file or directory
cp: cannot stat 'ref.fa.LTRlib.fa': No such file or directory
Error: LTR results not found!

ERROR: Raw LTR results not found in ref.fa.EDTA.raw/ref.fa.LTR.raw.fa at /data/modules/python/python-anaconda2-2019.10-EDTA/envs/EDTA/share/EDTA/EDTA.pl line 284.

libstdc++.so.6: version `GLIBCXX_3.4.20' not found

hi Shujun,

Unfortunately I'm still having trouble with this. Following on from my previous comment, I am now using my own installed version of RepeatMasker and everthing seems to work until it gets to TIR learner, where I am now getting the below error.


Fri Aug 16 19:10:44 CEST 2019	Dependency checking:
		All passed!
Fri Aug 16 19:10:56 CEST 2019	Obtain raw TE libraries using various structure-based programs:
/stn4/djeffrie/EDTA/bin/TIR-Learner1.19/../GenericRepeatFinder/bin//grf-main: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /stn4/djeffrie/EDTA/bin/TIR-Learner1.19/../GenericRepeatFinder/bin//grf-main)
/stn4/djeffrie/EDTA/bin/TIR-Learner1.19/../GenericRepeatFinder/bin//grf-main: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /stn4/djeffrie/EDTA/bin/TIR-Learner1.19/../GenericRepeatFinder/bin//grf-main)

Apparently I don't have the correct GLIBCXX version?

Any ideas?

Best

Dan

Originally posted by @DanJeffries in #11 (comment)

How to speed up for big genome?

Hi dear Shujun,
I have configured the environment for computing about the EDTA.
But I work on genome for amphibians, the genome size is bigger than other animals. I have run EDTA_raw for TIR, LTR, helitron.EDTA_raw.pl -genome frog1_genome.chromosome.fa -type tir -thrads 16. It's been running for 48 hours and it's not finished yet.
Is there any methods for speed up for big genomes?

Thank you for your attention and reply.

Zhangyi

Cannot install repeatmodeler or repeatmasker

Sorry this question may be irrelevant to EDTA. I am having problems installing repeatmodeler or repeatmasker. Could you please help me with this? Thanks. When I run "conda install -y -c bioconda repeatmodeler", the error messages look like this:

Collecting package metadata (current_repodata.json): done
Solving environment: failed with current_repodata.json, will retry with next repodata source.
Initial quick solve with frozen env failed. Unfreezing env and trying again.
Solving environment: failed with current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed
Initial quick solve with frozen env failed. Unfreezing env and trying again.
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Package tk conflicts for:
python=3.6 -> tk[version='8.6.|>=8.6.7,<8.7.0a0|>=8.6.8,<8.7.0a0']
Package libstdcxx-ng conflicts for:
python=3.6 -> libstdcxx-ng[version='>=7.2.0|>=7.3.0']
Package repeatscout conflicts for:
repeatmodeler -> repeatscout
Package perl-threaded conflicts for:
repeatmodeler -> perl-threaded
Package readline conflicts for:
python=3.6 -> readline[version='7.
|>=7.0,<8.0a0']
Package perl conflicts for:
repeatmodeler -> perl[version='5.22.0.|>=5.26.2,<5.27.0a0']
Package pip conflicts for:
python=3.6 -> pip
Package recon conflicts for:
repeatmodeler -> recon
Package libffi conflicts for:
python=3.6 -> libffi[version='3.2.
|>=3.2.1,<4.0a0']
Package ncurses conflicts for:
python=3.6 -> ncurses[version='6.0.|>=6.0,<7.0a0|>=6.1,<7.0a0']
Package zlib conflicts for:
python=3.6 -> zlib[version='>=1.2.11,<1.3.0a0']
Package xz conflicts for:
python=3.6 -> xz[version='>=5.2.3,<6.0a0|>=5.2.4,<6.0a0']
Package libgcc-ng conflicts for:
python=3.6 -> libgcc-ng[version='>=7.2.0|>=7.3.0']
Package trf conflicts for:
repeatmodeler -> trf
Package openssl conflicts for:
python=3.6 -> openssl[version='1.0.
|1.0.*,>=1.0.2l,<1.0.3a|>=1.0.2m,<1.0.3a|>=1.0.2n,<1.0.3a|>=1.0.2o,<1.0.3a|

=1.0.2p,<1.0.3a|>=1.1.1a,<1.1.2a|>=1.1.1c,<1.1.2a']
Package repeatmasker conflicts for:
repeatmodeler -> repeatmasker
Package perl-text-soundex conflicts for:
repeatmodeler -> perl-text-soundex
Package rmblast conflicts for:
repeatmodeler -> rmblast
Package sqlite conflicts for:
python=3.6 -> sqlite[version='>=3.20.1,<4.0a0|>=3.22.0,<4.0a0|>=3.23.1,<4.0a0|>=3.24.0,<4.0a0|>=3.25.2,<4.0a0|>
=3.26.0,<4.0a0|>=3.29.0,<4.0a0']

ERROR: Raw LTR/TIR/Helitron results not found in *.EDTA.raw/

I run EDTA pipeline for identifying TE using about 100 fungi isolate genome sequences. All genome sequences were de novo assembly. Around 70% isolates can get good results using EDTA pipeline. However, others can not get. with the error as following: I have tried lots of times.

Mon Dec 9 17:13:26 EST 2019 EDTA_raw: Check files and dependencies, prepare working directories.

Mon Dec 9 17:13:26 EST 2019 Start to find LTR candidates.

Mon Dec 9 17:13:26 EST 2019 Identify LTR retrotransposon candidates from scratch.

awk: fatal: cannot open file `L009.fa.pass.list' for reading (No such file or directory)
Warning: LOC list - is empty.

Error: Error while loading sequencecp: cannot stat ‘L009.fa.LTRlib.fa’: No such file or directory
cp: cannot stat ‘L009.fa.LTRlib.fa’: No such file or directory
Error: LTR results not found!

ERROR: Raw LTR results not found in L009.fa.EDTA.raw/L009.fa.LTR.raw.fa at /home/AAFC-AAC/fuf/EDTA/EDTA.pl line 250.
Did you meet like this error before?
Thanks,
Fuyou

number of used threads by LTR_FINDER

Hello, I am running the whole pipeline on a huge server. I specified 64 cores, but for the last 4 hours, the program (LTR_FINDER) is using only 6 threads that running on ~20% each resulting in ~single core run. Wonder what might have gone wrong.

I installed EDTA using conda (following instructions from README) and run it as follows

perl EDTA/EDTA.pl -genome my_genome.fasta -species others -step all -anno 1 -threads 64

In htop the executed program looks like this:

.../LTR_FINDER_parallel -seq scaffolds.fasta -threads 64 -harvest_out -size 1000000 -time 300

Thanks for making EDTA, it was a good twist in a benchmarking paper :-). By the way, did you try to compare EDTA with PiRATE? It also seems like a quite comprehensive pipeline, but I could find a comparison of the two.

TEsorter issue

Dear Shujun,

Thanks for developing EDTA. It's really helpful.
I am now running this pipeline for my genome but encounter an error:

2020-02-05 19:50:18,695 -INFO- generating gene anntations
Traceback (most recent call last):
File "/media/bulk_01/users/cai020/software/miniconda3/envs/EDTA/bin/TEsorter", line 10, in
sys.exit(main())
File "/media/bulk_01/users/cai020/software/miniconda3/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 976, in main
pipeline(Args())
File "/media/bulk_01/users/cai020/software/miniconda3/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 171, in pipeline
for rc in Classifier(gff, db=args.hmm_database, fout=fc):
File "/media/bulk_01/users/cai020/software/miniconda3/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 391, in classify
for rc in self.parse():
File "/media/bulk_01/users/cai020/software/miniconda3/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 380, in parse
line = LTRgffLine(line)
File "/media/bulk_01/users/cai020/software/miniconda3/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 609, in init
super(LTRgffLine, self).init(line)
File "/media/bulk_01/users/cai020/software/miniconda3/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 604, in init
self.attributes = self.parse(self.attributes)
File "/media/bulk_01/users/cai020/software/miniconda3/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 606, in parse
return dict(kv.split('=') for kv in attributes.split(';'))
ValueError: dictionary update sequence element #0 has length 3; 2 is required
Warning...unknown stuff <

my command line is below: (EDTA v1.7.9)
EDTA.pl -genome $genome -species others -step all -overwrite 0 -cds $cds -sensitive 0 -anno 1 -evaluate 1 -threads $thread -repeatmasker $repeatMasker

I checked your code and guess this might be caused by cleanup CDS with TEsorter, but not sure. I already generate $genome.mod.MAKER.masked, $genome.mod.EDTA.TEanno.gff/sum results. Now evaluating the level of inconsistency is running.
Could you please help me figure it out? Thank you very much in advance.

Best regards,
Chengcheng

Which lib should I use for Analysis about the animal genome?

Hi, dear professor Shujun,
I want to use EDTA to analysis some animal genome for de nove predict the TE. However, It looks like the EDTA's lib has Rice, I don't find a parameter for specify an animal lib.
Could the EDTA specify an animal lib? And, How about the effect of EDTA work on animal?

ZhangYi.

About repeat elements in the results of EDTA!

Hello,
I tried a small dataset and got the results as following:
Confusion matrix of BL06.R11.pilon.fasta.EDTA.TE.fa.stat for the all category
DNA/DTC DNA/DTH DNA/DTM LTR/Copia LTR/unknown MITE/DTM Misclas_rate
DNA/DTC 7 0 0 0 0 0 0.0000
DNA/DTH 0 1163 1 0 0 0 0.0009
DNA/DTM 0 0 7936 0 3 1 0.0005
LTR/Copia 0 0 0 259 0 0 0.0000
LTR/unknown 1 1 4 0 25193 1 0.0003
MITE/DTM 0 0 2 0 0 168 0.0118
So my question is that EDTA can analyze the repeat elments, such as AT-rich, GC-rich, short repeat elments, like (AT)n.
Thanks,
Fuyou

Keeping all full-length transposons?

Hello,
I might have a suggestion:
I was wondering if it wouldn't be useful for the final user to be able to get a file with the coordinates of the transposon, for example if one is interested to look at their position in the genome.

Thanks for EDTA, it's a cool pipeline.

Pipeline failed at LTR_FINDER_parallel step

Dear Shujun,

I ran the EDTA pipeline v1.7.1 for a genome with the following command. It failed at the step of identify LTR retrotransposon candidates from scratch.

perl /LabShares/Tools/EDTA/EDTA/EDTA.pl -genome DR_OL_ens90.fa -step all -cds DR_OL_cds_ens90.fa -sensitive 1 -anno 1
The STDERR showed an error:
Unsuccessful stat on filename containing newline at /LabShares/Tools/EDTA/EDTA/bin/LTR_FINDER_parallel/LTR_FINDER_parallel line 156, line 10314.

In the folder of LTR, a list of intermediate files have been generated:

alluniRefprexp082813.197723
alluniRefprexp082813.197723.phr
alluniRefprexp082813.197723.pin
alluniRefprexp082813.197723.psq
DR_OL_ens90.fa.finder.combine.scn
DR_OL_ens90.fa.harvest.scn
DR_OL_ens90.fa.list
DR_OL_ens90.fa.LTR.intact.fa
DR_OL_ens90.fa.LTR.intact.fa.ori
DR_OL_ens90.fa.LTR.intact.fa.ori.dusted
DR_OL_ens90.fa.LTR.intact.fa.ori.dusted.cleanup
DR_OL_ens90.fa.rawLTR.scn
Tpases020812DNA.197723
Tpases020812DNA.197723.phr
Tpases020812DNA.197723.pin
Tpases020812DNA.197723.psq
Tpases020812LINE.197723
Tpases020812LINE.197723.phr
Tpases020812LINE.197723.pin
Tpases020812LINE.197723.psq

All DR_OL_ens90.fa.LTR.intact* files are empty. Could you give me some suggestions to fix this?

Here I have the full STDERR attached for your reference.

Thank you so much.

Best,
Yixuan
AN_EDTA_DR_OL_ens90.txt

TIRlearner crashes Out of Memory

Hi !
I run into a memory issue trying to run TIR-Learner. Did you already run into it? And what can I do to solve this issue?

Here are the commands/outputs that I get:
$ nohup perl ../EDTA/EDTA_raw.pl -genome F2.genome.fasta -species Maize -type tir -threads 20 > essai_tir.out 2> essai_tir.err &
$ cat essai_tir.err
nohup: ignoring input
Wed Jan 15 19:24:43 CET 2020 EDTA_raw: Check files and dependencies, prepare working directories.

Wed Jan 15 19:24:43 CET 2020 Start to find TIR candidates.

ln: failed to create symbolic link 'F2.genome.fasta': Input/output error
Wed Jan 15 19:24:43 CET 2020 Identify TIR candidates from scratch.

Species: Maize
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!

ERROR: no LOC list!

Usage: perl call_seq_by_list.pl MSU_format_list -C genome.fasta -out file.fa [options]
	itself	Output sequence specified in the list (default).
	up_[int]	Output sequences [int] bp upstream of the region.
	down_[int]	Output sequences [int] bp downstream of the region.
	-C [fasta]	A fasta file you want to extract sequence from.
	-out	Output file name. Default: MSU_format_list.fa
	-header	[0|1]	Output sequence with (1, default) or without (0) sequence header.
	-rmvoid	[0|1]	Remove empty sequence (1, default) or retain empty sequence (0) in output.
	-ex		Exclude sequence specified by the list. Default: Output sequence specified by the list.
	-cov	[0-1]	Work with -ex. If excluding too much of the target (default 1), discard the entire sequence.
	-purge	[0|1]	Work with -ex. Switch on=1/off=0(default) to clean up aligned region and joint unaligned sequences.
Example: 
	Call sequence of upper 2000 bp region in the list and output to result.fa
		perl call_seq_by_list.pl array_list -C rice.fasta up_2000 -out result.fa

Out of memory!
Out of memory!
Out of memory!
Out of memory!

python2 was called when running cleanup_TE.pl

Dear Shujun,

When the -cds option is added, it seems like EDTA switches to use python2 for TEsorter. See cleanup_TE.pl line 36.

python2 $TEsorter $cds -p $threads;

This caused many incompatible issues for like Biopython between python2 and python3.

I installed a separate conda env of python2 for the TEsorter, but still got error in test:

(py2) [qiushi.li@itbioyeaman03 test]$ python ../TEsorter.py rice6.9.5.liban
2019-11-27 07:08:19,201 -WARNING- No module named drmaa
grid computing is not available
2019-11-27 07:08:19,203 -INFO- VARS: {'seq_type': 'nucl', 'min_coverage': 20, 'disable_pass2': False, 'tmp_dir': './tmp', 'processors': 4, 'sequence': 'rice6.9.5.liban', 'no_library': False, 'p2_identity': 80.0, 'no_cleanup': False, 'force_write_hmmscan': False, 'p2_length': 80.0, 'prefix': 'rice6.9.5.liban.rexdb', 'max_evalue': 0.001, 'p2_coverage': 80.0, 'pass2_rule': '80-80-80', 'hmm_database': 'rexdb', 'no_reverse': False}
2019-11-27 07:08:19,203 -INFO- checking dependencies:
2019-11-27 07:08:19,213 -INFO- hmmer 3.2.1 OK
Traceback (most recent call last):
File "../TEsorter.py", line 974, in
pipeline(Args())
File "../TEsorter.py", line 116, in pipeline
Dependency().check_blast()
File "../TEsorter.py", line 920, in check_blast
version = self.check_blast_version(program)
File "../TEsorter.py", line 939, in check_blast_version
version = re.compile(r'blast\S* ([\d.+]+)').search(out).groups()[0]
AttributeError: 'NoneType'

Best,
Qiushi

Nomenclature discrepancy?

Hi,
Here are the count from the TE library genome.FLYE.sixLongest.fa.EDTA.TElib.fa

DNA/DTA	52
DNA/DTC	50
DNA/DTH	476
DNA/DTM	654
DNA/DTT	2722
DNA/Helitron	15
LTR/Gypsy	38
LTR/unknown	20
MITE/DTA	75
MITE/DTC	10
MITE/DTH	88
MITE/DTM	104
MITE/DTT	570

Then I ran RepeatMasker
RepeatMasker genome.FLYE.sixLongest.fa -no_is -pa 8 -lib genome.FLYE.sixLongest.fa.EDTA.TElib.fa

Here is the summary

==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements         1333       187637 bp    0.16 %
   SINEs:               20         1160 bp    0.00 %
   Penelope             63         3689 bp    0.00 %
   LINEs:              487        62803 bp    0.05 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex         12          561 bp    0.00 %
     R1/LOA/Jockey      23         2819 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B          50        23094 bp    0.02 %
     L1/CIN4           177        20812 bp    0.02 %
   LTR elements:       826       123674 bp    0.11 %
     BEL/Pao           105         7431 bp    0.01 %
     Ty1/Copia           2          131 bp    0.00 %
     Gypsy/DIRS1       256        55114 bp    0.05 %
       Retroviral      179        10844 bp    0.01 %

DNA transposons       2314       176348 bp    0.15 %
   hobo-Activator      689        43072 bp    0.04 %
   Tc1-IS630-Pogo      167        54954 bp    0.05 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac             18         2279 bp    0.00 %
   Tourist/Harbinger   249        12509 bp    0.01 %
   Other (Mirage,       24         1231 bp    0.00 %
    P-element, Transib)

Rolling-circles         77         8371 bp    0.01 %

Unclassified:           51         3907 bp    0.00 %

Total interspersed repeats:      367892 bp    0.32 %


Small RNA:             431       137483 bp    0.12 %

Satellites:            130         7935 bp    0.01 %
Simple repeats:      48930      1869437 bp    1.61 %
Low complexity:       9266       432567 bp    0.37 %
==================================================

The number for the DNA transposons do not seem to match.
For example, I have more DNA elements reported from the non-redundant EDTA output than from RepeatMasker, but I would expect the opposite since RepeatMasker should count the occurrence of each element. Or am I missing something?

Identifying TIR uses only one CPU

Hi Shujun,
I am running EDTA.pl in a conda environment using --threads 30. The 'Identify LTR' step finished in less than one day and the 'Identify TIR' has been running for six days now. I've also noticed that this process is using only one CPU. Is it normal?

##### Extensive de-novo TE Annotator (EDTA) v1.7.9  ####
##### Shujun Ou ([email protected])             ####
########################################################



Mon Feb  3 19:37:57 -02 2020    Dependency checking:
                                All passed!

Mon Feb  3 19:38:41 -02 2020    Obtain raw TE libraries using various structure-based programs:
Mon Feb  3 19:38:41 -02 2020    EDTA_raw: Check dependencies, prepare working directories.

Mon Feb  3 19:38:53 -02 2020    Start to find LTR candidates.

Mon Feb  3 19:38:53 -02 2020    Identify LTR retrotransposon candidates from scratch.

Use of uninitialized value $chr_pre in hash element at /home/augustold/miniconda3/envs/EDTA/share/LTR_retriever/bin/call_seq_by_list.pl line 86.
Tue Feb  4 13:07:26 -02 2020    Finish finding LTR candidates.

Tue Feb  4 13:07:26 -02 2020    Start to find TIR candidates.

Tue Feb  4 13:07:26 -02 2020    Identify TIR candidates from scratch.

Species: others

Best wishes and thank you for providing this tool.

Happy EDTA users with successful cases

Hi all,

Just update the testing result. It seems that new release TIR can close this issue.

  1. Please install a new env for the EDTA 20190802 release
  2. Follow the step by the Shujun provided.
  • EDTA_raw
  • EDTA_processF
  • EDTA -step final
  1. The time and resource of my plant genome (336M plant genome, 58% repeat estimated by the GenomeScope, 24 cores machine)
Step maxvmem time(h) raw_fa size
Helitron 7.914GB 2.352222 1.3Mb
MITE 1.529GB 1.815278 4.9kb
TIR 42.127GB 4.895556 20Mb
LTR 19.049GB 1.417222 2.5Mb
EDTA_Final 19.388GB 19.42389 19Mb

Thanks for the developing.

Bests,
Zhigui

Originally posted by @baozg in #4 (comment)

RepeatModeler version used

Hi !
Thank you very much for this great tool! I was really pleased to discover it.
I have comments/questions related to RepeatModeler.
The version available within bioconda was wrong until recently (I fixed it before Christmas).

The RepeatModeler fix involved a small update of the RepeatMasker recipe. It also include trf by default now.
So I guess you could update the installation procedure:
conda install -n EDTA -y cd-hit repeatmodeler muscle mdust blast-legacy java-jdk perl perl-text-soundex multiprocess regex tensorflow=1.14.0 keras=2.2.4 scikit-learn=0.19.0 biopython pandas glob2 python=3.6.

RepeatModeler 2.0 now supports LTR structural search using a combination of LTR_harvest and LTR_retriever. How this will affect the result of EDTA? Do you have a benchmark? Should we avoid to use RepeatModeler LTR detection?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.