GithubHelp home page GithubHelp logo

oushujun / ltr_retriever Goto Github PK

View Code? Open in Web Editor NEW
169.0 7.0 40.0 52.7 MB

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.

Home Page: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/

License: GNU General Public License v3.0

Perl 99.43% Dockerfile 0.57%
ltr-retrotransposons genome-annotation genome-assembly ltr-retriever lai

ltr_retriever's Introduction

install with bioconda Anaconda-Server Badge

Table of Contents

Introduction

LTR_retriever is a command line program (in Perl) for accurate identification of LTR retrotransposons (LTR-RTs) from outputs of LTRharvest, LTR_FINDER, MGEScan 3.0.0, LTR_STRUC, and LtrDetector, and generates non-redundant LTR-RT library for genome annotations.

By default, the program will generate whole-genome LTR-RT annotation and the LTR Assembly Index (LAI) for evaluations of the assembly continuity of the input genome. Users can also run LAI separately (see Usage).

Installation

LTR_retriever is installation-free but requires dependencies: TRF, BLAST+, BLAST or CD-HIT, HMMER, RepeatMasker, and TEsorter. You may specify the path to these programs in the command line (run LTR_retriever -h for details) or install them in the following ways:

Quick installation using conda

Direct installation using the yml file:

conda env create -f /your_path_to/LTR_retriever/LTR_retriever.yml

Alternatively, you may use the conda recipe, but due to the large number of dependencies, conda solve may take hours... Unfortunately, the conda recipe currently could not be installed properly with mamba.

conda install -c bioconda -c conda-forge ltr_retriever

Step by step using conda

You may use conda to quickly install all dependencies and LTR_retriever is then good to go:

conda create -n LTR_retriever
conda activate LTR_retriever
conda install -y -c conda-forge perl perl-text-soundex
conda install -y -c bioconda cd-hit repeatmasker tesorter
git clone https://github.com/oushujun/LTR_retriever.git
./LTR_retriever/LTR_retriever -h

Standard installation

You can also provide the fixed paths to the following dependent programs.

  1. makeblastdb, blastn, and blastx in the BLAST+ package,
  2. cd-hit-est in the CDHIT package OR blastclust in the BLAST package,
  3. hmmsearch in the HMMER package (v3.1b2 or higher),
  4. RepeatMasker, and
  5. TEsorter.

Simply modify the 'paths' file in the LTR_retriever directory

vi /your_path_to/LTR_retriever/paths

Inputs

Two types of inputs are required for LTR_retriever

  1. Genomic sequence
  2. LTR-RT candidates

LTR_retriever takes multiple LTR-RT candidate inputs including the screen output of LTRharvest and the screen output of LTR_FINDER. For outputs of other LTR identification programs, you may convert them to LTRharvest-like format and feed them to LTR_retriever (with -inharvest). Users need to obtain the input file(s) from the aforementioned programs before running LTR_retriever. Either a single input source or a combination of multiple inputs are acceptable. For more details and examples please see the manual.

It's sufficient and recommended to use LTRharvest and LTR_FINDER results for LTR_retriever. However, if you want to analyze results from LTR_STRUC, MGEScan 3.0.0, and LtrDetector, you can use the following scripts to convert their outputs to the LTRharvest format, then feed LTR_retriever with -inharvest. You may concatenate multiple LTRharvest format inputs into one file. For instructions, run:

perl /your_path_to/LTR_retriever/bin/convert_ltr_struc.pl
perl /your_path_to/LTR_retriever/bin/convert_MGEScan3.0.pl
perl /your_path_to/LTR_retriever/bin/convert_ltrdetector.pl

Click to download executables for LTR_FINDER_parallel and LTRharvest.

Outputs

The output of LTR_retriever includes:

  1. Intact LTR-RTs with coordinate and structural information
    • Summary tables (.pass.list)
    • GFF3 format output (.pass.list.gff3)
  2. LTR-RT library
    • All non-redundant LTR-RTs (.LTRlib.fa)
    • All non-TGCA LTR-RTs (.nmtf.LTRlib.fa)
    • All LTR-RTs with redundancy (.LTRlib.redundant.fa)
  3. Whole-genome LTR-RT annotation by the non-redundant library
    • GFF format output (.out.gff)
    • LTR family summary (.out.fam.size.list)
    • LTR superfamily summary (.out.superfam.size.list)
    • LTR distribution on each chromosome (.out.LTR.distribution.txt)
  4. LTR Assembly Index (.out.LAI)

Usage

Best practice: It's highly recommended to use short and simple sequence names. For example, use letters, numbers, and _ to generate unique names shorter than 15 bits. If there are long sequence names, LTR_retriever will try to convert it for you, but not always successful.

To obtain raw input files with LTRharvest and LTR_FINDER_parallel:

/your_path_to/gt suffixerator -db genome.fa -indexname genome.fa -tis -suf -lcp -des -ssp -sds -dna
/your_path_to/gt ltrharvest -index genome.fa -minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6 -motif TGCA -motifmis 1 -similar 85 -vic 10 -seed 20 -seqids yes > genome.fa.harvest.scn
/your_path_to/LTR_FINDER_parallel -seq genome.fa -threads 10 -harvest_out -size 1000000 -time 300
cat genome.fa.harvest.scn genome.fa.finder.combine.scn > genome.fa.rawLTR.scn

To run LTR_retriever:

/your_path_to/LTR_retriever -genome genome.fa -inharvest genome.fa.rawLTR.scn -threads 10 [options]

To run LAI:

/your_path_to/LAI -genome genome.fa -intact genome.fa.pass.list -all genome.fa.out [options]

For more details about the usage and parameter settings, please see the help pages by running:

/your_path_to/LTR_retriever -h

/your_path_to/LAI -h

Or refer to the manual document.

For questions and Issues please see: https://github.com/oushujun/LTR_retriever/issues

Citations

If you find LTR_retriever useful, please cite:

Ou S. and Jiang N. (2018). LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons. Plant Physiol. 176(2): 1410-1422. open access

If you find LAI useful, please cite:

Ou S., Chen J. and Jiang N. (2018). Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. gky730. open access

ltr_retriever's People

Contributors

aseetharam avatar baozg avatar eernst avatar ghepardo avatar jebrosen avatar juke34 avatar oushujun avatar with9 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

ltr_retriever's Issues

very large reduction in LTRs when using LTR_retriever

Hello! I am using LTR_retriever in conjunction with results from LTRharvest, and I have a general question about the LTR_retriever results. I have found that LTR_retriever greatly reduces the number of elements discovered, going from 1220 in the LTRharvest output file to a total of 41 in the LTRlib.fa and nmtf.LTRlib.fa files. When I run RepeatMasker with the LTRharvest output I get around 48% of total bases in my genome masked, whereas when I RepeatMask with the LTR_retriever output, just 6% of the bases are masked. I know that LTR_retriever is designed specifically to remove unreliable candidate sequences from the library, but with the drastic difference in results when I implement the program, I just want to make sure I'm doing everything correctly. We expect a high repeat content for our genome, but I know that LTRharvest can produce a lot of false positives.

Here are the commands I used for running LTRharvest and LTR_retriever:

$GENOMETOOLS suffixerator -db genome.fasta -indexname genome -tis -suf -lcp -des -ssp -sds -dna -memlimit 200GB

$GENOMETOOLS ltrharvest -index genome -gff3 genome.ltrharvest.gff3 -seqids yes -minlenltr 100 -maxlenltr 5000 -mindistltr 1000 -maxdistltr 20000 -similar 85 -mintsd 4 -motif tgca -motifmis 1 -overlaps best -outinner outinner/genome.ltrharvest.outinner.fasta -out genome.ltrharvest.fasta > genome.ltrharvest.out

LTR_retriever -genome genome.fasta -inharvest genome.ltrharvest.out -threads 20 1>ltrretriever.log 2>ltrretriever.err

Do the parameter values seem okay to you? What do you think about the differences in the RepeatMasker results between the LTRharvest and LTR_retriever libraries? I would appreciate any input you have. I am also happy to continue this conversation over email, but thought I would post it here initially in case it might be helpful for anyone else running the program.

Thanks so much!

Kayla

Error messages when running on LTR_finder and MGEScan data

I ran the script on data from MGEScan (filename LTR_out) which contains a large number of sequences all with ID ">mobile_genetic_element". It then gave the following result:
`##########################

LTR_retriever v1.5

##########################

Contributors: Shujun Ou, Ning Jiang

Parameters: -genome Harm.fasta -inmgescan LTR_out.txt

Thu Nov 23 16:15:53 CET 2017 The longest sequence ID in the genome contains 141 characters, which is longer than the limit (15)
Trying to reformat seq IDs...
Attempt 1...
Thu Nov 23 16:15:55 CET 2017 Seq ID conversion successful!

Thu Nov 23 16:15:55 CET 2017 Start to convert inputs...
Use of uninitialized value $seq_ID in exists at /home/wolf/Desktop/Programs/LTR_retriever-master/bin/get_range.pl line 118, line 3.
Argument "NA" isn't numeric in numeric ge (>=) at /home/wolf/Desktop/Programs/LTR_retriever-master/bin/get_range.pl line 127, line 3.
Argument "NA" isn't numeric in numeric ge (>=) at /home/wolf/Desktop/Programs/LTR_retriever-master/bin/get_range.pl line 127, line 3.
Argument ">mobile_genetic_element1" isn't numeric in subtraction (-) at /home/wolf/Desktop/Programs/LTR_retriever-master/bin/get_range.pl line 134, line 3.
Illegal division by zero at /home/wolf/Desktop/Programs/LTR_retriever-master/bin/get_range.pl line 146, line 3.
ERROR: LOC list is empty.
Total candidates: 214
Total uniq candidates: 0

Thu Nov 23 16:15:58 CET 2017 Start to clean up candidates...
Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
Sequences containing tandem repeats will be discarded.

Error: Error while loading sequenceThu Nov 23 16:15:58 CET 2017 0 clean candidates remained

cp: cannot stat 'Harm.fasta.mod.retriever.scn.adj': No such file or directory
Thu Nov 23 16:15:58 CET 2017 No LTR was found in your data.

Thu Nov 23 16:15:58 CET 2017 All analyses were finished!
`

Any solutions?
Much appreciated

Errors in running LTR.identifier.pl

Hi Shujun,

I encountered the following problem when running LTR_retriever. I don't know if these errors affect the results, or how I should fix them.

Tue Nov 13 22:48:20 CST 2018 Modules 2-5: Start to analyze the structure of candidates...
The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.

substr outside of string at /disk02/caix/src/LTR_retriever/LTR_retriever-1.9/bin### /LTR.identifier.pl line 317.
Use of uninitialized value $seed in index at /disk02/caix/src/LTR_retriever/LTR_retriever-1.9/bin/LTR.identifier.pl line 321.
substr outside of string at /disk02/caix/src/LTR_retriever/LTR_retriever-1.9/bin/LTR.identifier.pl line 317.
Use of uninitialized value $seed in index at /disk02/caix/src/LTR_retriever/LTR_retriever-1.9/bin/LTR.identifier.pl line 321.
Use of uninitialized value $probTSD in uc at /disk02/caix/src/LTR_retriever/LTR_retriever-1.9/bin/LTR.identifier.pl line 325.
substr outside of string at /disk02/caix/src/LTR_retriever/LTR_retriever-1.9/bin/LTR.identifier.pl line 317.
Use of uninitialized value $seed in index at /disk02/caix/src/LTR_retriever/LTR_retriever-1.9/bin/LTR.identifier.pl line 321.
Use of uninitialized value $probTSD in uc at /disk02/caix/src/LTR_retriever/LTR_retriever-1.9/bin/LTR.identifier.pl line 325.
Use of uninitialized value $TSDlen in numeric le (<=) at /disk02/caix/src/LTR_retriever/LTR_retriever-1.9/bin/LTR.identifier.pl line 334.
Use of uninitialized value $probTSD in uc at /disk02/caix/src/LTR_retriever/LTR_retriever-1.9/bin/LTR.identifier.pl line 334.
Use of uninitialized value $probTSD in index at /disk02/caix/src/LTR_retriever/LTR_retriever-1.9/bin/LTR.identifier.pl line 358.
Use of uninitialized value $TSDlen in addition (+) at /disk02/caix/src/LTR_retriever/LTR_retriever-1.9/bin/LTR.identifier.pl line 358.
Use of uninitialized value $probTSD in index at /disk02/caix/src/LTR_retriever/LTR_retriever-1.9/bin/LTR.identifier.pl line 359.
Use of uninitialized value $TSDlen in numeric lt (<) at /disk02/caix/src/LTR_retriever/LTR_retriever-1.9/bin/LTR.identifier.pl line 360.
Wed Nov 14 00:25:57 CST 2018 Intact LTR-RT found: 5490

Wed Nov 14 00:35:14 CST 2018 Module 6: Start to analyze truncated LTR-RTs...
Truncated LTR-RTs without the intact version will be retained in the LTR-RT library.
Use -notrunc if you don't want to keep them.

best,
Xu Cai

Questions about running LTR_retriever

Hi Shujun,

I have some question when I run LTR_retriver. I realy need your help.

  1. I can't fully understand redundance. Does redundance means that one LTR-RT have many copys? Removing redundance is to keep one of these copys?
  2. I used the results of LTR_finder and LTR_harvest as input, and the candidates are 7565 (LTR_finder), 5507(LTR_harvest) and 8246(no-TGCA candidates from LTR_harvest), finaly, I get 2203 LTR-RTs (non-redundant​). I learned from your article that these two software have some problems. I am very interested in your pipelines, but I don't know which of the final results contains all the reliable intact LTR-RTs of the whole genome?
  3. In your paper, you have given 5 categories. Can I think that the no-TGCA LTR-RT is also a type of Intact LTR-RT?
  4. The non-redundant LTR library is very important. however, I still want to know whether LTR_retirver can provide how many copies of each LTR-RT in non-redundant LTR library.

Best,

Jiahe Liu

Low LAI for a plant assembly with 100x PacBio data

Hi Shujun,

I assembly a 335M diploid plant genome by Falcon-Unzip with 100x PacBio Sequel data, contig N50 9M. After using HiC data, I get chromosome-scale length genome with scaffold N50 32M.
I use LTR_retiever to find the LTR in this genome,41.19% of genome are LTR sequence,but the LAI is just 5.22 for this assembly.

It is very strange that a very contigous genome assembly only have LAI 5.22. What‘s the cause of low LAI ?Could be the assembly error?

Here is the full command I use.

# LTR_FINDER_parallel (it's super fast, Thanks for sharing, use the default parametets in perl script)
perl LTR_FINDER_parallel -seq genome.fa -threads 24 

# LTRharvest
gt suffixerator -db genome.fa -indexname genome.fa -tis -suf -lcp -des -ssp -sds -dna
gt ltrharvest -index genome.fa -similar 85 -vic 10 -seed 20 -seqids yes -minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6 -motifTGCA -motifmis 1 > ltr.harvest.scn

# LTR_retriever 
LTR_retriever -genome genome.fa -infinder genome.fa.finder.combine.scn -inharvest ltr.harvest.scn -threads 24

Here is the full log of the LTR_retriever

Thu Jul 11 14:21:09 CST 2019    Dependency checking: All passed!
Thu Jul 11 14:21:37 CST 2019    LTR_retriever is starting from the Init step.
Thu Jul 11 14:21:40 CST 2019    Start to convert inputs...
                                Total candidates: 4637
                                Total uniq candidates: 4637

Thu Jul 11 14:21:45 CST 2019    Module 1: Start to clean up candidates...
                                Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
                                Sequences containing tandem repeats will be discarded.

Thu Jul 11 14:25:34 CST 2019    3862 clean candidates remained

Thu Jul 11 14:25:34 CST 2019    Modules 2-5: Start to analyze the structure of candidates...
                                The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.

Thu Jul 11 14:31:17 CST 2019    Intact LTR-RT found: 867

Thu Jul 11 14:31:32 CST 2019    Module 6: Start to analyze truncated LTR-RTs...
                                Truncated LTR-RTs without the intact version will be retained in the LTR-RT library.
                                Use -notrunc if you don't want to keep them.

Thu Jul 11 14:31:32 CST 2019    489 truncated LTR-RTs found
Thu Jul 11 14:32:26 CST 2019    189 truncated LTR sequences have added to the library

Thu Jul 11 14:32:26 CST 2019    Module 5: Start to remove DNA TE and LINE transposases, and remove plant protein sequences...
                                Total library sequences: 908
Thu Jul 11 14:35:24 CST 2019    Retained clean sequence: 907

Thu Jul 11 14:35:24 CST 2019    Sequence clustering for genome.fa.ltrTE ...
Thu Jul 11 14:35:24 CST 2019    Unique lib sequence: 880

Thu Jul 11 14:35:42 CST 2019    Module 6: Start to remove nested insertions in internal regions...
Thu Jul 11 14:37:29 CST 2019    Raw internal region size (bit): 2410618
                                Clean internal region size (bit): 1600078

Thu Jul 11 14:37:29 CST 2019    Sequence number of the redundant LTR-RT library: 2790
                                The redundant LTR-RT library size (bit): 6059706

Thu Jul 11 14:37:29 CST 2019    Module 8: Start to make non-redundant library...

Thu Jul 11 14:37:41 CST 2019    Final LTR-RT library entries: 796
                                Final LTR-RT library size (bit): 1876515

Thu Jul 11 14:37:41 CST 2019    Total intact LTR-RTs found: 867
                                Total intact non-TGCA LTR-RTs found: 87

Thu Jul 11 14:37:43 CST 2019    Start to annotate whole-genome LTR-RTs...
                                Use -noanno if you don't want whole-genome LTR-RT annotation.
######################################
### LTR Assembly Index (LAI) beta3.1 ###
######################################

Developer: Shujun Ou

Please cite:

Ou S., Chen J. and Jiang N. (2018). Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. gky730: https://doi.org/10.1093/nar/gky730

Parameters: -genome genome.fa -intact genome.fa.pass.list -all genome.fa.out -t 24 -q -blast /data/software/Anaconda3/envs/maker/bin/


Thu Jul 11 15:11:12 CST 2019    Dependency checking: Passed!
Thu Jul 11 15:11:12 CST 2019    Calculation of LAI will be based on the whole genome.
                                Please use the -mono parameter if your genome is a recent ployploid, for high identity between homeologues will overcorrect raw LAI scores.
Thu Jul 11 15:11:12 CST 2019    Estimate the identity of LTR sequences in the genome: quick mode
Thu Jul 11 20:47:46 CST 2019    The identity of LTR sequences: 93.6033346761206%
Thu Jul 11 20:47:46 CST 2019    Calculate LAI:

                                                Done!

Thu Jul 11 20:47:54 CST 2019    Result file: genome.fa.out.LAI

                                You may use either raw_LAI or LAI for intraspecific comparison

But the final result of LAI is very low, here is the result.

Chr     From    To      Intact  Total   raw_LAI LAI
whole_genome    1       336252733       0.0169  0.4119  4.10    5.22

Output gff parameters

Hi Shujun,
I have a few question while running LAI. I am new to LTR, so I hope you can help me clarify a few things:

gff out file:

  1. What's the meaning of Diversity(%) and SW_score in the gff output?
  2. And if the sequece-region has 'INT' as suffix, does that mean "intact LTR"?
    Better to have an explanation of the Output file in the manual.

LAI output:

  1. when you calculate the whole genome raw LAI and LAI, do you use the mean value of all the scaffolds? what if my input file is a contig file, and many short contigs have no LTR (thus raw LAI and LAI = 0), will that impact on the whole genome raw LAI and LAI?
  2. I am not fully understand the correction on raw LTR that you mentioned in the paper "LAI score is correlated with the activities of LTR-RT ...". Does this correction has bias between autopolyploid and allopolyploid plants?
  3. what the decimal number of Intact and Total mean in the LAI output? Percentage of this LTR in this sequence?

Repeatability:

It happen to me that If I run the analysis on the same genome, I may get different LAI score, is that normal? Is that because you used any seed in the script?

I do get both ".mod.out" and '.out' as output, what's the difference?

About finder and harvest

Is there a way to estimate the False positive rate in these two outputs based on your retriever output?

Length ratio between LTR and the internal region (-max_ratio)

Greetings!

I have my LTRharvest and LTR_finder results ready to feed to LTR_retriever. I have one observation and a few questions based on it, before I use your software. Please read on:

Observation [from your manuscript]:
In your LTR_retriever paper, it
says "Empirically, more than 95% of LTR-RTs range from 1 to 15 kb"

In your user manual, here on Github, default parameters are:
-minlen [INT] = specify the minimum length (bp) of the LTR region (100)
-max_ratio [FLOAT] = specify the maximum length ratio of the internal region over the LTR region (default 50)

Questions:

1. Would it not be better to use --max_ratio = 15KB / 100bp = 150?

2. Do you think / know if there will be any drawbacks in using --max_ratio 150, rather than your default --max_ratio 50?

3. In the remaining 5% of LTR-RTs, how much shorter and longer can they be, based on your empirical determination?

4. Can LTR internal length be specified independent of flanking LTR sequence length? I mean, is there a --max-length parameter than can override --max_ratio if there are LTR-RTs outliers, in terms of internal sequence length?

Thanks, in advance, Shujun!

INT flag

Does that mean Interval sequence in TE with the "_INT" flag in *.LTRlib.fa .

Use of uninitialized value $ac/$bd

I am getting the following error while running v2.0. I don't know what caused it. Please help me.
Error: Argument "num_threads". Illegal value, expected (>=1 and =<32): `40'
Use of uninitialized value $bd in substitution (s///) at /data2/zhaoJing/bin/LTR_retriever-2.0/bin/LTR.identifier.pl line 301.
Use of uninitialized value $ac in substitution (s///) at /data2/zhaoJing/bin/LTR_retriever-2.0/bin/LTR.identifier.pl line 300.
Use of uninitialized value $ac in pattern match (m//) at /data2/zhaoJing/bin/LTR_retriever-2.0/bin/LTR.identifier.pl line 303.
Use of uninitialized value $bd in pattern match (m//) at /data2/zhaoJing/bin/LTR_retriever-2.0/bin/LTR.identifier.pl line 303.

INT flag

Does that mean Interval sequence in TE with the "_INT" flag in *.LTRlib.fa ?

LTR_retriever reports redundant/duplicated intact LTR-RT from inputs of both LTR_finder and LTR_harvest

Hi Shujun,

When results from both LTR_finder and LTR_harvest were given to the LTR_retriever, I found few likely duplicated intact LTR-RT results in the *pass.list and *pass.list.gff3 file, which are interesting:

like this:

tig00000022_1:257836..262566 pass motif:TGCA TSD:ACTAC 257831..257835 262567..262571 IN:258538..261865 0.9672 - unknown NA 1289956
tig00000022_1:257836..262566 pass motif:TGCA TSD:GTAGT 257831..257835 262567..262571 IN:258538..261865 0.9673 ? unknown NA 1285934

and this:

tig00000209:486380..491904 pass motif:TGCA TSD:TTG 486377..486379 491905..491907 IN:486636..491647 0.9804 - unknown LTR 763871
tig00000209:486385..491904 pass motif:TGCA TSD:NA .. .. IN:486636..491652 0.9802 - unknown LTR 771771

tig00000241:1060515..1070650 pass motif:TGCA TSD:TTTGT 1060510..1060514 1070651..1070655 IN:1060827..1070338 0.9904 - unknown LTR 371614
tig00000241:1061545..1066430 pass motif:TGCA TSD:AAAAC 1061540..1061544 1066431..1066435 IN:1061687..1066288 0.993 - unknown LTR 270495

It seems like that only part of the features (e.g. TSD) of the two redundant entries are different, but their locations on the genome were almost the same.

Despite the fact that the number of the likely duplicated intact LTR-RT is low (5 of 497 candidates), I think it is still good to ensure the results are reliable. How do I know which the better or proper predicted result is and remove the duplicated one?

Many thanks,

Hongbo

Problem with The RMblast engine dependency problem

Hi, Shujun,

I am facing the same problem with dependency of RMblast engine. I tried to follow Nancy's suggestions #43 . I don't have Taxononmy::new() error but successfully got the following files:

dummy060817.fa.926962 dummy060817.fa.926962.masked dummy060817.fa.926962.nin dummy060817.fa.926962.ori.out dummy060817.fa.926962.tbl
dummy060817.fa.926962.cat dummy060817.fa.926962.nhr dummy060817.fa.926962.nsq dummy060817.fa.926962.out

However, when I run ltrretriever 2.5 and ltrretriever 2.1 , I still got error:
Dependency checking: The RMblast engine is not installed in RepeatMasker!

I don't have this error when I use ltrretriever 1.6 and 20170514.

Best,

Ying

No such file or directory output_by_list.pl line 35

Hi Shujun,
I am getting the following error:
ERROR: No such file or directory at /usr/local/apps/eb/LTR_retriever/1.6-foss-2016b/bin/output_by_list.pl line 35.

The simplified command that I am using is this one:
LTR_retriever -genome $genome_file -inharvest $genome.ltrharvest.scn -infinder $genome.ltrfinder.scn -nonTGCA $genome.ltrharvest.nonTGCA.scn > $genome_file.LTR_retriever.out

It seems that is missing $genome_file.nmtf.ltrTE, but while LTR-Retriever is running, the following files with names having "$genome_file.nmtf.ltrTE*" are generated:

$genome_file.nmtf.ltrTE.LTR Sa_scaffold.fa.nmtf.ltrTE.LTR.masked $genome_file.nmtf.ltrTE.LTR.ori.out Sa_scaffold.fa.nmtf.ltrTE.LTR.tbl $genome_file.nmtf.ltrTE.fa.cleanup Sa_scaffold.fa.nmtf.ltrTE.stg1
$genome_file.nmtf.ltrTE.LTR.cat Sa_scaffold.fa.nmtf.ltrTE.LTR.masked.cleanup $genome_file.nmtf.ltrTE.LTR.out Sa_scaffold.fa.nmtf.ltrTE.fa Sa_scaffold.fa.nmtf.ltrTE.pass.list

Thank you for your help!
Cristina

too many LTR/unknown and most of LTR/unknown are classified as LTR/Copia

Thousands of LTR in a plant genome are clasified as unkown by LTR_retriever. However, most of them are clasified as Copia on the basis of GyDB as belows:

# *.retriever.scn.extend.fa.aa
  Count LTR_retriever   GyDB
    927 LTR     Copia   LTR     Copia
      2 LTR     Copia   LTR     Gypsy
     41 LTR     Gypsy   -       -
      1 LTR     Gypsy   LTR     Caulimoviridae
      5 LTR     Gypsy   LTR     Copia
   2266 LTR     Gypsy   LTR     Gypsy
      9 LTR     Gypsy   LTR     unknown
      5 LTR     unknown -       -
   1248 LTR     unknown LTR     Copia
     21 LTR     unknown LTR     Gypsy
      5 mixture Copia   -       -
     27 mixture Copia   LTR     Copia
      1 mixture Copia   LTR     Gypsy
      1 mixture Copia   Unknown unknown
     85 mixture Gypsy   LTR     Gypsy
      1 mixture unknown -       -
     14 mixture unknown LTR     Copia
      2 mixture unknown LTR     Gypsy
    352 notLTR  unknown -       -
      1 notLTR  unknown LTR     Caulimoviridae
      8 notLTR  unknown LTR     Copia
     17 notLTR  unknown LTR     Gypsy
     43 -       -       LTR     Copia   
    150 -       -       LTR     Gypsy 
      1 -       -       LTR     unknown 
      2 -       -       Unknown unknown 

I think there is an issue in annotate_TE.pl:

	$family="Gypsy" if ($gypsy>$copia and $copia/$gypsy<0.3);
	$family="Copia" if ($copia>$gypsy and $gypsy/$copia<0.3);

Copia has the same wieght (0.3) as Gypsy but Copia only has 8 PFAMs, ~1/3 of 28 PFAMs of Gypsy.

get flanking regions of all LTRs

Hello,

First, thanks for the really useful program :)

I would like to generate a library of all the upstream and downstream flanking regions adjacent to each LTR insertion in my reference genome. So, the downstream flanking region from the 5' LTR and the upstream flanking region from the 3' LTR (if that is clear?). I can then look for reads that span these genome-LTR boundaries in other sequencing datasets, to test for presence/absence of each LTR insertion.

What would be your recommended approach, based on the output of LTR_retriever? For example, the file ".pass.list.gff3" has very clear structural components for intact LTR-RTs, including both 5' and 3' LTRs themselves, but this file only has intact LTRs I think. The whole-genome annotation file "*.out.gff" has many more candidates, but it is a bit unclear what is exactly in this file: e.g., just LTRs themselves, or possibly other parts of the LTR-RT, including internal CDS and/or the whole element? Also, this file might contain internal and/or nested LTR-RTs, which might confuse things. Maybe the file "*.LTRlib.nonredundant.fa" is a better way to start, using BLAST to get the genomic coordinates for each LTR entry in this file. In this case, am I correct to say that the "*.LTRlib.fa" files contain only the LTR regions themselves?

Any advice would be much appreciated, and please let me know if I've misinterpreted some of the output files discussed above.

Many thanks for your time,
reubwn

Extract intact Ltrs to fasta

Hi, im trying to extract my results from the gff3 file to fasta and i get this

$ bedtools getfasta -fi Muschr4.fsa -bed Muschr4.fsa.mod.pass.list.gff3 -fo ERVCom.fa

WARNING. chromosome (seq0) was not found in the FASTA file. Skipping.
WARNING. chromosome (seq0) was not found in the FASTA file. Skipping.
WARNING. chromosome (seq0) was not found in the FASTA file. Skipping.
WARNING. chromosome (seq0) was not found in the FASTA file. Skipping.
WARNING. chromosome (seq0) was not found in the FASTA file. Skipping.
WARNING. chromosome (seq0) was not found in the FASTA file. Skipping.
WARNING. chromosome (seq0) was not found in the FASTA file. Skipping.
WARNING. chromosome (seq0) was not found in the FASTA file. Skipping.
WARNING. chromosome (seq0) was not found in the FASTA file. Skipping.

Its there another way to get the intact LTRs to fasta ?

Change my GFF3 legacy

Hi, so im already running LTR_Retriever and im happy with the results, now im trying to also run LTR_Digest to my [result.mod.pass.list.gff3] but it doesn't accept my GFF3 LTR_Retriever legacy.
$ gt ltrdigest: error: no description matched sequence ID 'seq0'
When using just harvest a solution to this "genometools/genometools#882" was to force LTRharvest to output its candidates in the current format, using the -tabout no and -seqids options. Im asking myself if theres an option to do this in LTR_Retriever.
I wanna use Digest so i can take advantage of the pHMMs files this program use.
Thx

Questions about *.pass.list

Hi Shujun,

From the log file, i know *.pass.list contain the whole genome LTR-RTs (redundant). I found some LTR-RT have the same coordinate with different TSD.
C01:2584769..2588663 pass motif:TGCA TSD:ATTAC 2584764..2584768 2588664..2588668 IN:2585668..2587763 0.9911 ? unknown NA 344355
C01:2584769..2588663 pass motif:TGCA TSD:GTAAT 2584764..2584768 2588664..2588668 IN:2585668..2587763 0.9911 - unknown NA 344355

I don't know why this happens. Are both results reliable LTR-RTs?

Best,
Jiahe

Blastn warning: subject sequence contains no data, and final retained clean sequence: 0

I downloaded the latest master version today of LTR_retriever and am trying to run it with my plant genome sequence with LTR_retriever -genome genome.fasta -inharvest genome.ltrharvest.scn -infinder genome.finder.scn -notrunc -threads 4 -v

It runs giving several warnings and finally gives retained clean sequence: 0

Here's all the files it created:

-rw------- 1 rimjhim rimjhim 336732337 Jun  7 16:39 genome.fasta
-rw------- 1 rimjhim rimjhim   1221771 Jun  7 19:42 genome.fasta.retriever.scn
-rw------- 1 rimjhim rimjhim   1100599 Jun  7 19:42 genome.fasta.retriever.scn.list
-rw------- 1 rimjhim rimjhim    531646 Jun  7 19:42 genome.fasta.retriever.scn.full
-rw------- 1 rimjhim rimjhim  90860431 Jun  7 19:42 genome.fasta.ltrTE.fa
-rw------- 1 rimjhim rimjhim     88066 Jun  7 19:48 genome.fasta.ltrTE.fa.cleanup
-rw------- 1 rimjhim rimjhim  77634137 Jun  7 19:48 genome.fasta.ltrTE.stg1
-rw------- 1 rimjhim rimjhim    441393 Jun  7 19:48 genome.fasta.retriever.scn.extend
-rw------- 1 rimjhim rimjhim  78564160 Jun  7 19:48 genome.fasta.retriever.scn.extend.fa
-rw------- 1 rimjhim rimjhim 159268484 Jun  7 19:50 genome.fasta.retriever.scn.extend.fa.aa
-rw------- 1 rimjhim rimjhim   4871577 Jun  7 19:51 genome.fasta.retriever.scn.extend.fa.aa.tbl
-rw------- 1 rimjhim rimjhim  12960211 Jun  7 19:51 genome.fasta.retriever.scn.extend.fa.aa.scn
-rw------- 1 rimjhim rimjhim    761839 Jun  7 19:51 genome.fasta.retriever.scn.extend.fa.aa.anno
-rw------- 1 rimjhim rimjhim   2850208 Jun  7 20:14 genome.fasta.defalse
-rw------- 1 rimjhim rimjhim   1775029 Jun  7 20:14 genome.fasta.retriever.scn.adj
-rw------- 1 rimjhim rimjhim    354229 Jun  7 20:14 genome.fasta.ltrTE.pass.list
-rw------- 1 rimjhim rimjhim  18232267 Jun  7 20:14 genome.fasta.ltrTE.pass
-rw------- 1 rimjhim rimjhim    399189 Jun  7 20:16 genome.fasta.ltrTE.pass.clust.clstr
-rw------- 1 rimjhim rimjhim   9189982 Jun  7 20:16 genome.fasta.ltrTE.stg2
-rw------- 1 rimjhim rimjhim     41920 Jun  7 20:16 genome.fasta.ltrTE.trunc.list
-rw------- 1 rimjhim rimjhim     94880 Jun  7 20:16 genome.fasta.retriever.scn.adj.list
-rw------- 1 rimjhim rimjhim   7174609 Jun  7 20:16 genome.fasta.ltrTE.trunc
-rw------- 1 rimjhim rimjhim     74082 Jun  7 20:16 genome.fasta.ltrTE.veryfalse.list
-rw------- 1 rimjhim rimjhim     25416 Jun  7 20:16 genome.fasta.ltrTE.veryfalse
-rw------- 1 rimjhim rimjhim   4941076 Jun  7 20:16 genome.fasta.ltrTE.veryfalse.fa
-rw------- 1 rimjhim rimjhim  14131058 Jun  7 20:16 genome.fasta.ltrTE.mask.lib
-rw------- 1 rimjhim rimjhim         0 Jun  7 20:16 genome.fasta.ltrTE.trunc.cln
-rw------- 1 rimjhim rimjhim         0 Jun  7 20:16 genome.fasta.ltrTE.trunc.masked.cleanup
-rw------- 1 rimjhim rimjhim   9189982 Jun  7 20:16 genome.fasta.ltrTE.stg3.cln
-rw------- 1 rimjhim rimjhim    927563 Jun  7 20:19 genome.fasta.ltrTE.stg3.line.out
-rw------- 1 rimjhim rimjhim    865686 Jun  7 20:22 genome.fasta.ltrTE.stg3.dna.out
-rw------- 1 rimjhim rimjhim   1793249 Jun  7 20:22 genome.fasta.ltrTE.stg3.otherTE.out
-rw------- 1 rimjhim rimjhim         0 Jun  7 20:22 genome.fasta.ltrTE.stg3.cln.exclude.list
-rw------- 1 rimjhim rimjhim         0 Jun  7 20:22 genome.fasta.ltrTE.stg3.cln.clean
-rw------- 1 rimjhim rimjhim         0 Jun  7 20:22 genome.fasta.ltrTE.stg3.plantP.out
-rw------- 1 rimjhim rimjhim        47 Jun  7 20:22 genome.fasta.ltrTE.stg3.cln.clean.exclude.list
-rw------- 1 rimjhim rimjhim         0 Jun  7 20:22 genome.fasta.ltrTE
-rw------- 1 rimjhim rimjhim      7822 Jun  7 20:22 genome.fasta.ltrTE.pass.nmtf.list
-rw------- 1 rimjhim rimjhim      7822 Jun  7 20:22 genome.fasta.nmtf.pass.list
-rw------- 1 rimjhim rimjhim    354229 Jun  7 20:22 genome.fasta.pass.list

And this is the output I got.

##########################
### LTR_retriever v1.2 ###
##########################

Contributors: Shujun Ou, Ning Jiang

Parameters: -genome genome.fasta -inharvest genome.ltrharvest.scn -infinder genome.finder.scn -threads 4 -v


Previous LTR_retriever results found, backed up to LTRretriever-pre06-07-17_1942

Mit Jun  7 19:42:12 CEST 2017	Start to convert inputs...
				Total candidates: 12077
				Total uniq candidates: 11267

Mit Jun  7 19:42:16 CEST 2017	Start to clean up candidates...
				Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
				Sequences containing tandem repeats will be discarded.

Mit Jun  7 19:48:24 CEST 2017	9370 clean candidates remained

Mit Jun  7 19:48:24 CEST 2017	Start to analyze the structure of candidates...
				The terminal motif, TSD, boundary, orientation, age, and family will be identified in this step.

BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
Warning: [blastn] Subject_1 chr5:12326794..12333995|chr5:12326844..12333945: Subject sequence contians no data
BLAST engine error: Warning: Sequence contains no data 
Warning: [blastn] Subject_1 chr6:21651306..21669448|chr6:21651356..21669398: Subject sequence contians no data
Warning: [blastn] Subject_1 chr6:22457268..22468094|chr6:22457318..22468044: Subject sequence contians no data
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
Warning: [blastn] Subject_1 chr6:33073266..33087676|chr6:33073316..33087626: Subject sequence contians no data
BLAST engine error: Warning: Sequence contains no data 
Warning: [blastn] Subject_1 chr7:20294844..20300297|chr7:20294894..20300247: Subject sequence contians no data
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
Warning: [blastn] Subject_1 chr7:25108611..25136833[1]: Subject sequence contians no data
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
Warning: [blastn] Subject_1 chr7:36106845..36124957|chr7:36106895..36124907: Subject sequence contians no data
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
Warning: [blastn] Subject_1 chr8:4633806..4645912|chr8:4633856..4645862: Subject sequence contians no data
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
Warning: [blastn] Subject_1 scaffold286:18245..31457|scaffold286:18295..31407: Subject sequence contians no data
BLAST engine error: Warning: Sequence contains no data 
Mit Jun  7 20:14:11 CEST 2017	Intact LTR found: 2670

Mit Jun  7 20:16:48 CEST 2017	Start to analyze truncated LTRs...
				Truncated LTRs without the intact version will be retained in the LTR library.
				Use -notrunc if you don't want to keep them.

Mit Jun  7 20:16:48 CEST 2017	884 truncated LTRs found
ERROR: No such file or directory at /home/rimjhim/Softwares/LTR_retriever-master/bin/cleanup.pl line 50.
Mit Jun  7 20:16:54 CEST 2017	0 truncated LTR sequences have added to the library

Mit Jun  7 20:16:54 CEST 2017	Start to remove DNA TE and LINE transposases, and remove plant protein sequences...
				Total library sequences: 1865
sort: multi-character tab ‘$\t’
ERROR: LOC list is empty.
Warning: [blastx] Query is Empty!
Mit Jun  7 20:22:07 CEST 2017	Retained clean sequence: 0

ERROR: 2670 intact LTRs have found, but the pre-library file genome.fasta.ltrTE is empty.
Something is wrong at this point. Please report the bug to https://github.com/oushujun/LTR_retriever/issues
Program halt!

I am not sure why it is complaining that subject sequence contains no data and why the file genome.fasta.ltrTE is empty. Any suggestions?

Sequence ID length problem

Hi. LTR_retriever stops with a RepeatMasker error resulting from a sequence ID longer than 50 characters in the file xxx.fa.mod.ltrTE.trunc. The ID in question is >LSRX01000097.1:1074794..1082843|LSRX01000097.1:1074380..1083242[IN]

See that the original sequence IDs are not particularly long. but due to the large coordinate numbers the IDs become long. Is there a fix for this problem? Below is the whole output including the repeatmasker test run

Thanks. Claudio

##########################

LTR_retriever v1.8.0

##########################

Contributors: Shujun Ou, Ning Jiang

Please cite: Ou S, Jiang N: LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiology 2018, 176:1410-1422
Parameters: -genome /home/manager/BigShare/dinos/11-200.fa -infinder /home/manager/LTR_Finder/source/11-200.finder.scn

四 5月 31 19:51:58 CST 2018 Dependency checking: All passed!
四 5月 31 19:52:52 CST 2018 The longest sequence ID in the genome contains 109 characters, which is longer than the limit (15)
Trying to reformat seq IDs...
Attempt 1...
四 5月 31 19:53:12 CST 2018 Seq ID conversion successful!

四 5月 31 19:53:12 CST 2018 Start to convert inputs...
Total candidates: 173
Total uniq candidates: 173

四 5月 31 19:53:25 CST 2018 Module 1: Start to clean up candidates...
Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
Sequences containing tandem repeats will be discarded.

四 5月 31 19:53:31 CST 2018 145 clean candidates remained

四 5月 31 19:53:31 CST 2018 Modules 2-5: Start to analyze the structure of candidates...
The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.

四 5月 31 19:54:13 CST 2018 Intact LTR-RT found: 118

Can't remove /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.pass.clust: Text file busy, skipping file.
四 5月 31 19:54:30 CST 2018 Module 6: Start to analyze truncated LTR-RTs...
Truncated LTR-RTs without the intact version will be retained in the LTR-RT library.
Use -notrunc if you don't want to keep them.

四 5月 31 19:54:30 CST 2018 4 truncated LTR-RTs found
Can't remove /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.trunc: Text file busy, skipping file.
Warning: LOC list /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.veryfalse is empty.
ERROR: RepeatMasker is not running properly!
Please check the file /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.mask.lib and /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.trunc and test run:
RepeatMasker -e ncbi -q -pa 4 -no_is -norna -nolow -div 40 -lib /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.mask.lib -cutoff 225 /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.trunc
Please report errors to https://github.com/oushujun/LTR_retriever/issues
Program halt!

manager@sb:~/RepeatMasker$ ./RepeatMasker -e ncbi -q -pa 4 -no_is -norna -nolow -div 40 -lib /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.mask.lib -cutoff 225 /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.trunc
RepeatMasker version open-4.0.7
Search Engine: NCBI/RMBLAST [ 2.2.27+ ]
Master RepeatMasker Database: /home/manager/RepeatMasker/Libraries/RepeatMaskerLib.embl ( Complete Database: dc20170127-rb20170127 )
Custom Repeat Library: /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.mask.lib

analyzing file /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.trunc
FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
at ./RepeatMasker line 718.

ERROR: Fail to convert seq IDs to less than 15 character

So i wanna run my harvest results in Retriever and 1 of 22 genomes give me this problem.

$$$ ERROR: Fail to convert seq IDs to less than 15 characters! Please provide a genome with shorter seq IDs.
In harvest i used:

gt ltrharvest -index 1.fna -seqids Yes tabout no -seed 30 -xdrop 5 -mat 2 -mis -2 -ins -3 -del -3 -minlenltr 100 -maxlenltr 7000 -mindistltr 1000 -maxdistltr 15000 -similar 80.0 -overlaps no -mintsd 4 -maxtsd 20 -motif TGCA -motifmis 1 -vic 60 > 1.harvest.scn

in Retriever:
perl LTR_retriever -genome /medicina/wocana/Tesis/Secuencias/Retriever/1/1.fna -inharvest /medicina/wocana/Tesis/Secuencias/Harvest/1/1.harvest.scn

Any idea how i can fix it?

Unable to run LTR_retriever

I am trying to run LTR_retriever from the ltr_finder results using the following scripts:

ltr_finder -D 15000 -d 1000 -L 7000 -l 100 -p 20 -C -M 0.9 /home/icar/Jute_genomes/ltr_finder/LLWS01.1.fa > /home/icar/Jute_genomes/ltr_finder/LTR_retriever_out/jro_524_genome.finder.scn

and then

perl /home/icar/Programs/LTR_retriever-master/LTR_retriever -genome /home/icar/Jute_genomes/ltr_finder/LLWS01.1.fa -infinder /home/icar/Jute_genomes/ltr_finder/LTR_retriever_out/jro_524_genome.finder.scn

but received the following error:

##########################

LTR_retriever v1.8.0

##########################

Contributors: Shujun Ou, Ning Jiang

Please cite: Ou S, Jiang N: LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiology 2018, 176:1410-1422
Parameters: -genome /home/icar/Jute_genomes/ltr_finder/LLWS01.1.fa -infinder /home/icar/Jute_genomes/ltr_finder/LTR_retriever_out/jro_524_genome.finder.scn

Thu May 17 12:56:14 IST 2018 Dependency checking: The RMblast engine is not installed in RepeatMasker!

Tough I have configured Repeatmasker and received the following message, I am getting the above error while running LTR_retriever:
Add a Search Engine:

  1. CrossMatch: [ Un-configured ]

  2. RMBlast - NCBI Blast with RepeatMasker extensions: [ Configured ]

  3. WUBlast/ABBlast (required by DupMasker): [ Un-configured ]

  4. HMMER3.1 & DFAM: [ Configured, Default ]

  5. Done

Enter Selection: 5
-- Setting perl interpreter...

Congratulations! RepeatMasker is now ready to use.
The program is installed with a minimal repeat library
by default. This library only contains simple, low-complexity,
and common artefact ( contaminate ) sequences. These are
adequate for use with your own custom repeat library. If you
plan to search using common species specific repeats you will
need to obtain the complete RepeatMasker repeat library from
GIRI ( www.giriinst.org ) and install it in /home/icar/Programs/RepeatMasker.

Further documentation on the program may be found here:
/home/icar/Programs/RepeatMasker/repeatmasker.help

Please help me to resolve the above problem in running the LTR_retriever. I am using Ubuntu 14.04 LTS 64 bit
Thanks

difference LAI index quick run and full run

Hi
I am running the LAI index on large plant genomes (~20Gbp) and also very fragmented genome assembly.
I have the following results for -q run and standard mode run
Complete

Chr	From	To	Intact	Total	raw_LAI	LAI
whole_genome	1	24626904232	0.0037	0.5263	0.70	1.55

Quick

Chr	From	To	Intact	Total	raw_LAI	LAI
whole_genome	1	24626904232	0.0037	0.5263	0.70	6.33

The difference is quite significant for the final LAI index. the raw LAI looks the same. Is that a normal behavior?

For another run I have the following Warning:

Warning: [blastn] lcl|Query_272220 s0165314:245..410|s0165314:245..410: Warning: Could not calculate ungapped Karlin-Altschul parameters due to an invalid query sequence or its translation. Please verify the query sequence(s) and/or filtering options 
Warning: [blastn] lcl|Query_1541219 s1185854:917..1100|s1185854:917..1100: Warning: Could not calculate ungapped Karlin-Altschul parameters due to an invalid query sequence or its translation. Please verify the query sequence(s) and/or filtering options

Is there something wrong in the sequence?

Thank you in advance for the reply!

call_seq_by_list.py error: substr outside of string

Hello,

I am running LTR retriever, and I am coming across these errors:

Thu Jun 27 16:18:20 PDT 2019    Dependency checking: All passed!
Thu Jun 27 16:19:07 PDT 2019    Start to convert inputs...
substr outside of string at /projects/btl/lcoombe/git/LTR_retriever/bin/call_seq_by_list.pl line 127.
Use of uninitialized value $seq in string eq at /projects/btl/lcoombe/git/LTR_retriever/bin/call_seq_by_list.pl line 128.
substr outside of string at /projects/btl/lcoombe/git/LTR_retriever/bin/call_seq_by_list.pl line 127.
Use of uninitialized value $seq in string eq at /projects/btl/lcoombe/git/LTR_retriever/bin/call_seq_by_list.pl line 128.
substr outside of string at /projects/btl/lcoombe/git/LTR_retriever/bin/call_seq_by_list.pl line 127.

I partitioned my fasta file into 25 parts, and ran my pipeline on each of those parts independently, and I only see this error in 2 partitions.

Here are example commands that lead to the error:

/projects/btl/lcoombe/git/LTR_Finder/source/ltr_finder -D 15000 -d 1000 -L 7000 -l 100 -p 20 -M 0.9 pengelmannii_sealed_ntEdit-scaffolds.1000plus.renamed.seqtk_2.fa > pengelmannii_sealed_ntEdit-scaffolds.1000plus.renamed.seqtk_2.fa.finder.scn 
gt suffixerator -db pengelmannii_sealed_ntEdit-scaffolds.1000plus.renamed.seqtk_2.fa -indexname pengelmannii_sealed_ntEdit-scaffolds.1000plus.renamed.seqtk_2.fa.index -tis -suf -lcp -des -ssp -sds -dna
gt ltrharvest -index pengelmannii_sealed_ntEdit-scaffolds.1000plus.renamed.seqtk_2.fa.index -similar 90 -vic 10 -seed 20 -seqids yes \
-minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6 -motif TGCA \
-motifmis 1 -gff3 pengelmannii_sealed_ntEdit-scaffolds.1000plus.renamed.seqtk_2.fa.harvest.gff > pengelmannii_sealed_ntEdit-scaffolds.1000plus.renamed.seqtk_2.fa.harvest.scn
gt ltrharvest -index pengelmannii_sealed_ntEdit-scaffolds.1000plus.renamed.seqtk_2.fa.index -similar 90 -vic 10 -seed 20 -seqids yes \
-minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6 -gff3 pengelmannii_sealed_ntEdit-scaffolds.1000plus.renamed.seqtk_2.fa.harvest.nonTGCA.gff > pengelmannii_sealed_ntEdit-scaffolds.1000plus.renamed.seqtk_2.fa.harvest.nonTGCA.scn 
/usr/bin/time -pv LTR_retriever -genome /projects/spruceup_scratch/pengelmannii/Se404-851/annotation/repeat-annotation/custom-library-construction/post-ntEdit/tmp/pengelmannii_sealed_ntEdit-scaffolds.1000plus.renamed.seqtk_2.fa -infinder /projects/spruceup_scratch/pengelmannii/Se404-851/annotation/repeat-annotation/custom-library-construction/post-ntEdit/tmp/pengelmannii_sealed_ntEdit-scaffolds.1000plus.renamed.seqtk_2.fa.finder.scn -inharvest /projects/spruceup_scratch/pengelmannii/Se404-851/annotation/repeat-annotation/custom-library-construction/post-ntEdit/tmp/pengelmannii_sealed_ntEdit-scaffolds.1000plus.renamed.seqtk_2.fa.harvest.scn -nonTGCA /projects/spruceup_scratch/pengelmannii/Se404-851/annotation/repeat-annotation/custom-library-construction/post-ntEdit/tmp/pengelmannii_sealed_ntEdit-scaffolds.1000plus.renamed.seqtk_2.fa.harvest.nonTGCA.scn -threads 4 -noanno

I also ran the command with LTR finder input only:

Parameters: -genome /projects/spruceup_scratch/pengelmannii/Se404-851/annotation/repeat-annotation/custom-library-construction/post-ntEdit/tmp/pengelmannii_sealed_ntEdit-scaffolds.1000plus.renamed.seqtk_2.fa -infinder /projects/spruceup_scratch/pengelmannii/Se404-851/annotation/repeat-annotation/custom-library-construction/post-ntEdit/tmp/pengelmannii_sealed_ntEdit-scaffolds.1000plus.renamed.seqtk_2.fa.finder.scn -threads 4 -noanno

but I still get the same errors.

As a little hack, I added some print statements to see what sequence IDs that line was failing at, and one example where the start coordinate was larger than the length was 1124.
Some ranges in pengelmannii_sealed_ntEdit-scaffolds.1000plus.renamed.seqtk_2.fa.retriever.scn.list:

1124:3434244..3441138[1]        1124:3434244..3435259
1124:3434244..3441138[2]        1124:3440131..3441138
1124:3511226..3513429[1]        1124:3511226..3511696
1124:3511226..3513429[2]        1124:3512959..3513429

I took a look in the LTR finder results, and the ranges look find to me:

[lcoombe@hpce706 tmp]$ grep -A1 "] 1124 " pengelmannii_sealed_ntEdit-scaffolds.1000plus.renamed.seqtk_2.fa.finder.scn 
[1] 1124 Len:2974043
Location : 662584 - 666076 Len: 3493 Strand:+
--
[2] 1124 Len:2974043
Location : 859961 - 865173 Len: 5213 Strand:-
--
[3] 1124 Len:2974043
Location : 952814 - 962235 Len: 9422 Strand:-
--
[4] 1124 Len:2974043
Location : 995557 - 1009639 Len: 14083 Strand:-
--
[5] 1124 Len:2974043
Location : 1061270 - 1066964 Len: 5695 Strand:+
--
[6] 1124 Len:2974043
Location : 2075056 - 2080445 Len: 5390 Strand:+
--
[7] 1124 Len:2974043
Location : 2194945 - 2211924 Len: 16980 Strand:-
--
[8] 1124 Len:2974043
Location : 2235274 - 2241093 Len: 5820 Strand:+
--
[9] 1124 Len:2974043
Location : 2430346 - 2439961 Len: 9616 Strand:-
--
[10] 1124 Len:2974043
Location : 2638949 - 2648557 Len: 9609 Strand:+

Any ideas as to what the issue is? Thank you!

Fatal Error: Failed to open the database file

@mcscimenc I moved your last bug report to this new thread.
Start forwarding:

OK, the program runs for a little while and now gives this error:

ERROR: No such file or directory at /home/joshd/software/LTR_retriever/bin/cleanup.pl line 50.
ERROR: No such file or directory at /home/joshd/software/LTR_retriever/bin/cleanup.pl line 50.

Fatal Error:
Failed to open the database file
Program halted !!

Can't open Salvinia_cucullata_v1.1.fa.LTRlib.clust: No such file or directory.
ERROR: This script is written to convert fasta files into a prettier format.
Usage: fasta-reformat.pl input-fasta-file number-of-positions-per-line
ERROR: No such file or directory at /home/joshd/software/LTR_retriever/bin/annotate_gff.pl line 12.

Salvinia_cucullata_v1.1.fa.LTRlib.clust doesn't exist, and I ran LTR_retriever with -v

Uninitialized value error from get_range.pl

Running with harvest and finder inputs. The errors are produced when the program is executed with just harvest outputs, just finder outputs and when both are run together. The command utilized to run is as follows: LTR_retriever -genome genome.fasta -inharvest harvScreen.out -infinder findScreen.out -threads 16. The command is run using a batch scheduler as per infrastructure rules.

The errors appear as follows:
Use of uninitialized value $seq_ID in exists at ./bin/get_range.pl line 120, line 3.
Use of uninitialized value $seq_ID in exists at ./bin/get_range.pl line 120, line 4.
Use of uninitialized value $seq_ID in exists at ./bin/get_range.pl line 120, line 5.
Use of uninitialized value $seq_ID in exists at ./bin/get_range.pl line 120, line 6.
.........
Use of uninitialized value $seq_ID in exists at ./bin/get_range.pl line 120, line 135241.
Use of uninitialized value $seq_ID in exists at ./bin/get_range.pl line 120, line 135242.
Use of uninitialized value $seq_ID in exists at ./bin/get_range.pl line 120, line 135243.
Use of uninitialized value $seq_ID in exists at ./bin/get_range.pl line 120, line 135244.
Warning: LOC list genome.fasta.retriever.scn.full is empty.
Usage: perl cleanup.pl -f sample.fa [options] > sample.cln.fa
Options:
-misschar n Define the letter representing unknown sequences...
-Nscreen [0|1] Enable (1) or disable (0) the -nc parameter; default: 1
-nc [int] Ambuguous sequence len cutoff; discard the entire....
-nr [0-1] Ambuguous sequence percentage cutoff; discard the entire sequence....
-cleanN [0|1] Retain (0) or remove (1) the -misschar taget in output sequence
-trf [0|1] Enable (1) or disable (0) tandem repeat finder (trf); default: 1
-trf_path path Path to the trf program
cp: cannot stat ‘genome.fasta.retriever.scn.adj’: No such file or directory

I hope that this is sufficient information to assist with this error. All dependent software is up to date, to my knowledge. Also, LTR_retriever is configured to use CDHIT.

Error with -nonTGCA option

Hello,

I'm trying to run LTR_retriever with an LTRharvest output file, where I ran LTRharvest without the -motif TGCA option.

Here's my code:

LTR_retriever -genome ./PEP_scaffolder_300bp.fasta -nonTGCA ./PEP_scaffolder_300bp.ltrharvest.out -threads 25

and here's the error I'm getting:

##########################

LTR_retriever v1.6

##########################

Contributors: Shujun Ou, Ning Jiang

Please cite: S. Ou and N. Jiang (2017) LTR_retriever: a highly accurate and sensitive program for identification of long terminal-repeat retrotransposons. Plant Physiology, pp.01310.2017; DOI: 10.1104/pp.17.01310
Parameters: -genome ./PEP_scaffolder_300bp.fasta -nonTGCA ./PEP_scaffolder_300bp.ltrharvest.out -threads 25

Mon Jan 29 11:26:10 PST 2018 Dependency checking: All passed!
Mon Jan 29 11:26:40 PST 2018 The longest sequence ID in the genome contains 23 characters, which is longer than the limit (15)
Trying to reformat seq IDs...
Attempt 1...
Mon Jan 29 11:26:46 PST 2018 Seq ID conversion successful!

Mon Jan 29 11:26:46 PST 2018 Start to convert inputs...
grep: ./PEP_scaffolder_300bp.fasta.mod.retriever.scn: No such file or directory
Argument "" isn't numeric in numeric gt (>) at /mnt/lfs2/schaack/src/LTR_retriever/LTR_retriever line 327.

ERROR: No candidate is found in the file(s) you specified.

It seems like (from a cursory glance at your perl code) this error only comes up when trying to process the -inharvest file. Can you run LTR_retriever with just a -nonTGCA file for the LTR candidates, or is the -inharvest file required?

Thanks!

Zombie process is generated when run LTR.identifier.pl

Hi Shujun,

   I found some  process  become zombie process when I had been run it a long time;

There are partly result of command top :

SN Jul11 0:00 \_ sh -c perl /bin/LTR.identifier.pl final.fasta ...
Sl Jul11 393:51 \_ perl /bin/LTR.identifier.pl final.fasta ...
Z Jul12 0:00 \_ [perl] <defunct>
Z Jul12 0:01 \_ [perl] <defunct>
Z Jul12 0:01 \_ [perl] <defunct>
Z Jul12 0:01 \_ [sh] <defunct>
Z Jul12 0:01 \_ [perl] <defunct>
Z Jul12 0:01 \_ [perl] <defunct>
Z Jul12 0:01 \_ [perl] <defunct>

   I ran two different data, both of which will be stuck in these states(Z).

Can you give some advice to avoid such problems? should I use single thread? Thanks!

Support for MGEScan

Hi Dr. Ou,

According to the manual of LTR_retriever, the input from MGEScan was generated by "a modified version of DAWGPAWS" -- find_ltr_DAWGPAWS.pl. And bin/run_MGEScan.pl also calls for the script:

my $DAWGPAWS_path="/mnt/home/oushujun/git_bin/MGEScan_LTR";

However, it seems that the script has not been provided in the LTR_retriever.

I have tried to use the original version of DAWGPAWS by myself, but it is quite hard to use. I should be most grateful if you would send me the script (Email: [email protected]).

Thank you very much
Xiaofei

BLAST engine error and uninitialized values

I've been trying to run LTR_retriever on my data with the following command:
LTR_retriever -genome c.fa -inharvest c.harvest.scn
I'm using a freshly cloned version of LTR_retriever (as of 6/4/2017), and ensured that all files and folders have read/write permissions.

While the program runs to completion, several errors show up and no LTR's are found.

The following errors occur multiple times, in an inconsistent manner across separate runs. Running the program multiple times results in these errors popping up a different number of times in each run.

BLAST engine error: NCBI C++ Exception:
    "/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_491_130.14.22.10_9052_1363121432/c++/src/algo/blast/api/blast_setup.hpp", line 189: Error: Sequence contains no data

BLAST engine error: Warning: Sequence contains no data

Use of uninitialized value $rLTR_length in numeric ge (>=) at /usr/local/LTR_retriever/bin/get_range.pl line 125, <TBL> line 5115.
Use of uninitialized value $lLTR_length in numeric ge (>=) at /usr/local/LTR_retriever/bin/get_range.pl line 125, <TBL> line 5115.
Argument "" isn't numeric in addition (+) at /usr/local/LTR_retriever/bin/get_range.pl line 132, <TBL> line 5115.
Use of uninitialized value $rLTR_start in subtraction (-) at /usr/local/LTR_retriever/bin/get_range.pl line 132, <TBL> line 5115.

As for the program not finding any LTRs in the dataset, that's strange since the data is a genome wherein a significant portion is composed of transposons. I'm not sure if this is because something went wrong or if it's because there really wasn't anything there. I've also run LTR_retriever with additional inputs but the result is the same. Are there any sample data sets to confirm if the program is working properly?

Here's the entire output that the program gives:

##########################
### LTR_retriever v1.1 ###
##########################

Contributors: Shujun Ou, Ning Jiang

Parameters: -genome c.fa -inharvest c.harvest.scn


Tue Jun  6 12:45:12 EDT 2017    The longest sequence ID in the genome contains 117 characters, which is longer than the limit (15)
                                Trying to reformat seq IDs...
                                Attempt 1...
Tue Jun  6 12:45:15 EDT 2017    Seq ID conversion successful!

Tue Jun  6 12:45:15 EDT 2017    Start to convert inputs...
                                Total candidates: 5597
                                Total uniq candidates: 5597

Tue Jun  6 12:45:19 EDT 2017    Start to clean up candidates...
                                Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
                                Sequences containing tandem repeats will be discarded.

Tue Jun  6 12:49:30 EDT 2017    4654 clean candidates remained

Tue Jun  6 12:49:30 EDT 2017    Start to analyze the structure of candidates...
                                The terminal motif, TSD, boundary, orientation, age, and family will be identified in this step.

BLAST engine error: NCBI C++ Exception:
    "/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_491_130.14.22.10_9052_1363121432/c++/src/algo/blast/api/blast_setup.hpp", line 189: Error: Sequence contains no data

BLAST engine error: Warning: Sequence contains no data
BLAST engine error: Warning: Sequence contains no data
BLAST engine error: Warning: Sequence contains no data
Tue Jun  6 13:00:13 EDT 2017    Intact LTR found: 1400

Use of uninitialized value $rLTR_length in numeric ge (>=) at /usr/local/LTR_retriever/bin/get_range.pl line 125, <TBL> line 5115.
Use of uninitialized value $lLTR_length in numeric ge (>=) at /usr/local/LTR_retriever/bin/get_range.pl line 125, <TBL> line 5115.
Argument "" isn't numeric in addition (+) at /usr/local/LTR_retriever/bin/get_range.pl line 132, <TBL> line 5115.
Use of uninitialized value $rLTR_start in subtraction (-) at /usr/local/LTR_retriever/bin/get_range.pl line 132, <TBL> line 5115.
Tue Jun  6 13:00:58 EDT 2017    Start to analyze truncated LTRs...
                                Truncated LTRs without the intact version will be retained in the LTR library.
                                Use -notrunc if you don't want to keep them.

Tue Jun  6 13:00:58 EDT 2017    710 truncated LTRs found
Use of uninitialized value $rLTR_length in numeric ge (>=) at /usr/local/LTR_retriever/bin/get_range.pl line 125, <TBL> line 5115.
Use of uninitialized value $lLTR_length in numeric ge (>=) at /usr/local/LTR_retriever/bin/get_range.pl line 125, <TBL> line 5115.
Argument "" isn't numeric in addition (+) at /usr/local/LTR_retriever/bin/get_range.pl line 132, <TBL> line 5115.
Use of uninitialized value $rLTR_start in subtraction (-) at /usr/local/LTR_retriever/bin/get_range.pl line 132, <TBL> line 5115.
Tue Jun  6 13:04:47 EDT 2017    123 truncated LTR sequences have added to the library

Tue Jun  6 13:04:47 EDT 2017    Start to remove DNA TE and LINE transposases, and remove plant protein sequences...
                                Total library sequences: 1092
sort: multi-character tab ‘$\t’
ERROR: LOC list is empty.
Tue Jun  6 13:10:39 EDT 2017    Retained clean sequence: 0

Tue Jun  6 13:10:39 EDT 2017    Sequence clustering for c.fa.mod.ltrTE ...
ERROR: c.fa.mod.ltrTE is empty, please check the last file
Tue Jun  6 13:10:39 EDT 2017    Unique lib sequence: 0

Use of uninitialized value $rLTR_length in numeric ge (>=) at /usr/local/LTR_retriever/bin/get_range.pl line 125, <TBL> line 5115.
Use of uninitialized value $lLTR_length in numeric ge (>=) at /usr/local/LTR_retriever/bin/get_range.pl line 125, <TBL> line 5115.
Argument "" isn't numeric in addition (+) at /usr/local/LTR_retriever/bin/get_range.pl line 132, <TBL> line 5115.
Use of uninitialized value $rLTR_start in subtraction (-) at /usr/local/LTR_retriever/bin/get_range.pl line 132, <TBL> line 5115.
Tue Jun  6 13:10:40 EDT 2017    No LTR was found in your data.

rm: cannot remove ‘c.fa.mod.cat.gz’: No such file or directory
rm: cannot remove ‘c.fa.mod.LTRlib.clust.clstr’: No such file or directory
rm: cannot remove ‘c.fa.mod.LTRlib’: No such file or directory
rm: cannot remove ‘c.fa.mod.LTRlib.fa.n*’: No such file or directory
rm: cannot remove ‘c.fa.mod.nmtf’: No such file or directory
Tue Jun  6 13:10:40 EDT 2017    All analyses were finished!

##############################
####### Result files #########
##############################

Table output for intact LTRs (detailed info)
        c.fa.mod.pass.list (All LTRs)
        c.fa.mod.nmtf.pass.list (Non-TGCA LTRs)

LTR library
        c.fa.mod.LTRlib.fa (All non-redundant LTRs)
        c.fa.mod.nmtf.LTRlib.fa (Non-TGCA LTRs)

GFF3 file for intact LTRs
        c.fa.mod.pass.list.gff3

Whole-genome LTR annotation (GFF)
        c.fa.mod.LTRanno.gff

soloLTR discovery and reporting

Hi Shujun,

I've completed by LTRharvest run, and close to completing the LTR_FINDER run as well.

In terms of solo LTR discovery and reporting, of all the Perl scripts listed under LTR_retriever/bin/, I can recognize only two Perl scripts that may be relevant: solo_finder.pl, and solo_intact_ratio.pl

From header information inside this script, it looks like solo_finder.pl requires RepeatMasker results as input. Am I correct? Or can either or both of my LTRharvest / LTR_FINDER results be used by solo_finder.pl?

And a slightly different but related question:
Does LTR-retriever report partial length, i.e. truncated LTR-RTs that are neither solo LTRs, nor full-length LTRs, but in-betweens?

Thanks!

How to solve the ambiguous of whole genome annotation and all LTR-RTs files?

Hi, shujun

When I get the result of whole genome annotation and all LTR-RTs. I find the two results are ambiguous, which makes me confused. I think it's the result of using the uniq lib for annotation. Is that right? Is it necessary to remove the entries that differ from annotaion file in intact LTR_RTs file? Examples are as follows, only 4 of 11 intact LTR-RTs can be annotated precisely.

image
image
image
image
image
image
image
image

LTR superfamily summary and distribution output

Hi,

My LTR-retriever run does give all the outputs except three of the following.
I just wonder how to get these as well, do i need to specify in my run?

The output of LTR_retriever includes:
LTR family summary (.out.fam.size.list)
LTR superfamily summary (.out.superfam.size.list)
LTR distribution on each chromosome (.out.LTR.distribution.txt)

cmd used:
/LTR_retriever -genome Bnigra_ref2_genome.fasta -inharvest Bnigra_ref2_genome.harvest.nonTGCA.scn -infinder Bnigra_ref2_genome.LTRfinder.scn -blastclust -L .9 -b T -S 80 -threads 10

Thanks a lot!
Sam

no LAI report

Hi Shujun,
I have a few runs with normal LTR-retriever.scn outputs, but no LAI report.
I noticed there are NO 'out' file from RepeatMasker in the folder.
No error message was captured on the std error. Any idea what's the problem?

Below are two examples:

-rw-r--r--  1 root  root                          46 Sep 17 20:04 genomeA__ver100.applied_reference_genome.genome.fasta.des
-rw-r--r--  1 root  root                        492M Sep 17 20:04 genomeA__ver100.applied_reference_genome.genome.fasta.esq
-rw-r--r--  1 root  root                         38M Sep 19 22:56 genomeA__ver100.applied_reference_genome.genome.fasta.finder.scn
-rw-r--r--  1 root  root                        4.4M Sep 17 21:19 genomeA__ver100.applied_reference_genome.genome.fasta.harvest.scn
-rw-r--r--  1 root  root                        2.0G Sep 17 20:43 genomeA__ver100.applied_reference_genome.genome.fasta.lcp
-rw-r--r--  1 root  root                        2.3G Sep 17 20:43 genomeA__ver100.applied_reference_genome.genome.fasta.llv
-rw-r--r--  1 root  root                        450M Sep 19 01:19 genomeA__ver100.applied_reference_genome.genome.fasta.ltrTE.fa
-rw-r--r--  1 root  root                        208K Sep 19 01:36 genomeA__ver100.applied_reference_genome.genome.fasta.ltrTE.fa.cleanup
-rw-r--r--  1 root  root                        415M Sep 19 01:36 genomeA__ver100.applied_reference_genome.genome.fasta.ltrTE.stg1
-rw-r--r--  1 root  root                         363 Sep 17 20:04 genomeA__ver100.applied_reference_genome.genome.fasta.md5
-rw-r--r--  1 root  root                         503 Sep 17 20:43 genomeA__ver100.applied_reference_genome.genome.fasta.prj
-rw-r--r--  1 root  root                        4.4M Sep 17 21:20 genomeA__ver100.applied_reference_genome.genome.fasta.retriever.scn
-rw-r--r--  1 root  root                        2.1M Sep 19 01:36 genomeA__ver100.applied_reference_genome.genome.fasta.retriever.scn.extend
-rw-r--r--  1 root  root                        345M Sep 20 00:21 genomeA__ver100.applied_reference_genome.genome.fasta.retriever.scn.extend.fa
-rw-r--r--  1 root  root                        2.2M Sep 17 21:20 genomeA__ver100.applied_reference_genome.genome.fasta.retriever.scn.full
-rw-r--r--  1 root  root                        4.5M Sep 17 21:20 genomeA__ver100.applied_reference_genome.genome.fasta.retriever.scn.list
-rw-r--r--  1 root  root                          80 Sep 17 20:04 genomeA__ver100.applied_reference_genome.genome.fasta.sds
-rw-r--r--  1 root  root                          48 Sep 17 20:04 genomeA__ver100.applied_reference_genome.genome.fasta.ssp
-rw-r--r--  1 root  root                         16G Sep 17 20:43 genomeA__ver100.applied_reference_genome.genome.fasta.suf
drwxr-xr-x  2 evrpa RstudioUsersGenomeAnalytics 6.0K Sep 17 17:01 1/
-rw-r--r--  1 root  root                         44M Sep 17 21:19 alluniRefprexp082813.15942
-rw-r--r--  1 root  root                         15M Sep 17 21:20 alluniRefprexp082813.15942.phr
-rw-r--r--  1 root  root                        801K Sep 17 21:20 alluniRefprexp082813.15942.pin
-rw-r--r--  1 root  root                         36M Sep 17 21:20 alluniRefprexp082813.15942.psq
-rw-r--r--  1 root  root                           0 Sep 17 19:34 std.err.ltr_finder
-rw-r--r--  1 root  root                           0 Sep 17 20:43 stderr.ltr_harvest
-rw-r--r--  1 root  root                           0 Sep 17 21:19 std.err.ltr_retriever
-rw-r--r--  1 root  root                        1.6M Sep 17 21:19 Tpases020812DNA.15942
-rw-r--r--  1 root  root                        340K Sep 17 21:19 Tpases020812DNA.15942.phr
-rw-r--r--  1 root  root                         19K Sep 17 21:19 Tpases020812DNA.15942.pin
-rw-r--r--  1 root  root                        1.4M Sep 17 21:19 Tpases020812DNA.15942.psq
-rw-r--r--  1 root  root                        2.0M Sep 17 21:19 Tpases020812LINE.15942
-rw-r--r--  1 root  root                        306K Sep 17 21:19 Tpases020812LINE.15942.phr
-rw-r--r--  1 root  root                         19K Sep 17 21:19 Tpases020812LINE.15942.pin
-rw-r--r--  1 root  root                        1.8M Sep 17 21:19 Tpases020812LINE.15942.psq
-rw-r--r--  1 root  root                         44M Sep 18 05:28 alluniRefprexp082813.876134
-rw-r--r--  1 root  root                         15M Sep 18 05:28 alluniRefprexp082813.876134.phr
-rw-r--r--  1 root  root                        801K Sep 18 05:28 alluniRefprexp082813.876134.pin
-rw-r--r--  1 root  root                         36M Sep 18 05:28 alluniRefprexp082813.876134.psq
-rw-r--r--  1 root  root                           0 Sep 17 19:30 std.err.ltr_finder
-rw-r--r--  1 root  root                           0 Sep 18 04:58 stderr.ltr_harvest
-rw-r--r--  1 root  root                           0 Sep 18 05:28 std.err.ltr_retriever
-rw-r--r--  1 root  root                        1.6M Sep 18 05:28 Tpases020812DNA.876134
-rw-r--r--  1 root  root                        340K Sep 18 05:28 Tpases020812DNA.876134.phr
-rw-r--r--  1 root  root                         19K Sep 18 05:28 Tpases020812DNA.876134.pin
-rw-r--r--  1 root  root                        1.4M Sep 18 05:28 Tpases020812DNA.876134.psq
-rw-r--r--  1 root  root                        2.0M Sep 18 05:28 Tpases020812LINE.876134
-rw-r--r--  1 root  root                        306K Sep 18 05:28 Tpases020812LINE.876134.phr
-rw-r--r--  1 root  root                         19K Sep 18 05:28 Tpases020812LINE.876134.pin
-rw-r--r--  1 root  root                        1.8M Sep 18 05:28 Tpases020812LINE.876134.psq
lrwxrwxrwx  1 evrpa RstudioUsersGenomeAnalytics   77 Sep 11 14:59 Zea_mays.AGPv4.dna_sm.toplevel.fa 
-rw-r--r--  1 root  root                         18K Sep 18 04:20 Zea_mays.AGPv4.dna_sm.toplevel.fa.des
-rw-r--r--  1 root  root                        509M Sep 18 04:21 Zea_mays.AGPv4.dna_sm.toplevel.fa.esq
-rw-r--r--  1 root  root                         29M Sep 19 15:23 Zea_mays.AGPv4.dna_sm.toplevel.fa.finder.scn
-rw-r--r--  1 root  root                        4.6M Sep 18 05:28 Zea_mays.AGPv4.dna_sm.toplevel.fa.harvest.scn
-rw-r--r--  1 root  root                        2.0G Sep 18 04:58 Zea_mays.AGPv4.dna_sm.toplevel.fa.lcp
-rw-r--r--  1 root  root                        2.0G Sep 18 04:58 Zea_mays.AGPv4.dna_sm.toplevel.fa.llv
-rw-r--r--  1 root  root                        8.6K Sep 18 04:20 Zea_mays.AGPv4.dna_sm.toplevel.fa.md5
-rw-r--r--  1 root  root                        2.1G Sep 18 05:29 Zea_mays.AGPv4.dna_sm.toplevel.fa.mod
-rw-r--r--  1 root  root                        477M Sep 19 11:57 Zea_mays.AGPv4.dna_sm.toplevel.fa.mod.ltrTE.fa
-rw-r--r--  1 root  root                        146K Sep 19 12:15 Zea_mays.AGPv4.dna_sm.toplevel.fa.mod.ltrTE.fa.cleanup
-rw-r--r--  1 root  root                        451M Sep 19 12:15 Zea_mays.AGPv4.dna_sm.toplevel.fa.mod.ltrTE.stg1
-rw-r--r--  1 root  root                        4.6M Sep 18 05:29 Zea_mays.AGPv4.dna_sm.toplevel.fa.mod.retriever.scn
-rw-r--r--  1 root  root                        2.2M Sep 19 12:15 Zea_mays.AGPv4.dna_sm.toplevel.fa.mod.retriever.scn.extend
-rw-r--r--  1 root  root                        154M Sep 20 00:21 Zea_mays.AGPv4.dna_sm.toplevel.fa.mod.retriever.scn.extend.fa
-rw-r--r--  1 root  root                        2.3M Sep 18 05:29 Zea_mays.AGPv4.dna_sm.toplevel.fa.mod.retriever.scn.full
-rw-r--r--  1 root  root                        4.7M Sep 18 05:29 Zea_mays.AGPv4.dna_sm.toplevel.fa.mod.retriever.scn.list
-rw-r--r--  1 root  root                         502 Sep 18 04:58 Zea_mays.AGPv4.dna_sm.toplevel.fa.prj
-rw-r--r--  1 root  root                        2.1K Sep 18 04:20 Zea_mays.AGPv4.dna_sm.toplevel.fa.sds
-rw-r--r--  1 root  root                        1.1K Sep 18 04:21 Zea_mays.AGPv4.dna_sm.toplevel.fa.ssp
-rw-r--r--  1 root  root                         16G Sep 18 04:58 Zea_mays.AGPv4.dna_sm.toplevel.fa.suf

upload jbrowse with jekyll

Hi
I know how to use Jekyll, but I don't know how to add the jbrowse file to Jekyll and then upload it to github.
Any idea?
thanks

Short sequences in LTRlib.fa

Hi Shujun,

I was wondering if it's expected that there are many short sequences in ~.LTRlib.fa? Of the 8424 entries, 81 are 11 bp long, and 542 are less than 100 bp long.

Also, is ~.LTRlib.fa the file that I can use as a custom RepeatMasker library for other related genome assemblies?

Thanks,
Austin

substr outside of string

Hi,
I've been having some trouble running LTR_retriever on my genome. I get the error noted below hundreds of thousands of times in my log, and no LTRs end up being identified.

Thu Jul 13 17:37:31 PDT 2017 Start to convert inputs...
substr outside of string at /projects/btl/shammond/git/LTR_retriever/bin/call_seq_by_list.pl line 120.
Use of uninitialized value $seq in string eq at /projects/btl/shammond/git/LTR_retriever/bin/call_seq_by_list.pl line 121.

I supplied predictions from LTRharvest, both TGCA and non-TGCA, and LTRfinder.

Also, I don't know if it's related to the above, but I see some unexpected sequences in my nmtf.ltrTE.fa file, such as blank or all-N sequences.

Could there be a problem with my input?

how to use a TE library from different accession as input to extract TEs?

Hi
I found this tool is really interesting and I like to use it for my research.
I have a TE library generated from canola_A genotype in fasta and gff format and I like to use the same library to retrieve TE from Canola_B (different genotype) accession.
how can I this canola_A TElibrary as input to extract the all the TE members ( expecting more copies in Canola_B) from Canola_B.

I believe that your suggestions will be really helpful.

Thanks & regards
sam

LTR_retriever aborts, Illegal character error.

Hi! I'm excited to use LTR_retriever. I ran LTR_retriever using the following call:

LTR_retriever \
	-genome Salvinia_cucullata_v1.1.fa \
	-inharvest LTRHarvest.out \
	-linelib LINEs.viridiplantae.fa \
	-dnalib DNA_TEs.viridiplantae.fa \
	-TEhmm Dfam.hmm \
	-threads 40 \
	1>LTR_retriever.out
	2>LTR_retriever.err

And it ran for a little while then aborted and reported these errors. There is no char E in the input fasta for -genome and the field *.scn.extend.fa.aa seems to have been removed by LTR_retriever. What could the problem be?

Parse failed (sequence file Salvinia_cucullata_v1.1.fa.retriever.scn.extend.fa.aa):
Line 2: illegal character E

Attempt to free unreferenced scalar: SV 0x6e2310, Perl interpreter: 0x6de7e0.
Use of uninitialized value $list[0] in pattern match (m//) at /home/joshd/software/LTR_retriever/bin/call_seq_by_list.pl line 76.
Use of uninitialized value in split at /home/joshd/software/LTR_retriever/bin/call_seq_by_list.pl line 79.
Use of uninitialized value in pattern match (m//) at /home/joshd/software/LTR_retriever/bin/call_seq_by_list.pl line 79.
Use of uninitialized value $chr_pre in hash element at /home/joshd/software/LTR_retriever/bin/call_seq_by_list.pl line 81.
Use of uninitialized value within %genome in length at /home/joshd/software/LTR_retriever/bin/call_seq_by_list.pl line 81.
Use of uninitialized value $list[0] in pattern match (m//) at /home/joshd/software/LTR_retriever/bin/call_seq_by_list.pl line 76.
Use of uninitialized value in split at /home/joshd/software/LTR_retriever/bin/call_seq_by_list.pl line 79.
Use of uninitialized value in pattern match (m//) at /home/joshd/software/LTR_retriever/bin/call_seq_by_list.pl line 79.
Use of uninitialized value $chr_pre in hash element at /home/joshd/software/LTR_retriever/bin/call_seq_by_list.pl line 81.
Use of uninitialized value within %genome in length at /home/joshd/software/LTR_retriever/bin/call_seq_by_list.pl line 81.
No such file or directory at /home/joshd/software/LTR_retriever/bin/cleanup.pl line 50.
LOC list is empty.
Warning: [blastx] Query is Empty!
LOC list is empty.
No such file or directory at /home/joshd/software/LTR_retriever/bin/cleanup.pl line 50.

#usage: $ perl output_by_list.pl DB_index_pos database LS_index_pos LIST [Exclusive]* [MSU_format] [FASTA_format] [version]> outfile
        * [] parameters are optional. 
                [Exclusive] -ex means exclude the entries in list, default is output the entries in list. 
                [MSU_format] -MSU0 means MSU_LOC occurs in the list file, while -MSU1 means MSU_LOC occurs in the database file
                eg. perl output_by_list.pl 1 Chr1.ltrTE.RMlist 1 Chr1.ltrTE.true.list -MSU0 -FA > Chr1.ltrTE.true.RMlist
rm: cannot remove `Salvinia_cucullata_v1.1.fa.cat.gz': No such file or directory
rm: cannot remove `Salvinia_cucullata_v1.1.fa.LTRlib.fa.n*': No such file or directory
rm: cannot remove `Salvinia_cucullata_v1.1.fa.nmtf': No such file or directory
perl annotate_gff.pl lib.fa gff > anno.gff

LTRharvest as input

Hi Shujun,
I notice the updated Readme file now suggest using "LTRharvest" as input file.
Are there any reason LTR_finder no longer preferred?
Is the result based on solo LTRharvest robust enough?

Fail to locate the paths file

Greetings,

I am interested in trying the LTR_retriever strategy to validate hits I obtained from an LTRharvest dataset. I downloaded the files associated with LTR_retriever and modified the "paths" file appropriately, as below:

BLAST+=/Users/abcd/ncbi-blast-2.2.30+/bin/
RepeatMasker=/Users/abcd/RepeatMasker/
HMMER=/Users/abcd/hmmer-3.1b2-macosx-intel/binaries/
CDHIT=/Users/abcd/cd-hit-v4.6.8-2017-1208/

I saved the modified paths file with the vi text editor. However, when I tried the command:
perl LTR_retriever -genome ToxoDB35ME49Genome.fa -inharvest ToxoDB35.harvest.scn

I obtained the error:
Fail to locate the paths file!

The paths file was present in my working directory (LTR_retriever), and is not located in a subdirectory. Any idea what the problem may be? Any suggestions would be greatly appreciated!
Josh

call_seq_by_list.pl uninitialized $chr_pre ? It will cause nagative start or end.

Hi,
oushujun,
I use the Ltr_retriver find LTR based on LTR_finder and LTR haverst result. but I got the error:
call_seq_by_list.pl uninitialized $chr_pre
So I find the 84 line.
Floowing I found,
my $chr_pre=$1 if (split /\s+/, $list[0])[1]=~/(.*):-?[0-9]+\.\.[0-9]+$/;
I anaysis the code , it may be cased by too short scaffold, when the program extend the sequence forward or backward. It will occur nagative start or end.
I try to change the code to
my $chr_pre=$1 if (split /\s+/, $list[0])[1]=~/(.*):-?[0-9]+\.\.-?[0-9]+$/;
make it match the nagtive end. but failed, other Error I encountered.
Can you help me ?
How can i fix this Error?

Ltr insertion time

hi,this is a very good software, I noticed this software gives the insertion time, could you please explain what kind of algorithm you use to get the insertion time? And the unit of the insertion time column? Is it the year or not.

invertebrate support

Hi,

I was wondering if LTR_retriever supports invertebrate genomes. We have an amphioxus genome derived from 60X Pacbio sequencing, however, it shows the LAI score is only 7.07. Moreover, all of the 206 LTRs in LTRlib.fa were classified as 'Unknown'. Does this look normal to you?

Thank you!

Luohao

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.