tobybaril / earlgrey Goto Github PK

View Code? Open in Web Editor NEW

113.0 5.0 17.0 5.25 MB

Earl Grey: A fully automated TE curation and annotation pipeline

License: Other

Shell 19.90% Perl 3.66% Python 51.60% R 21.95% Dockerfile 2.88%

bioinformatics genomics transposable-elements genome-annotation genome-analysis te-annotations

earlgrey's People

Contributors

Stargazers

Watchers

Forkers

yuzhenpeng mal2017 cnyuanh jwdebler vonrosenchild altingia wangchengww jiangchb augeas annabel-nz mark-lubberts mmastert ikkaku1005 pythseq ravindra-raut ventson

earlgrey's Issues

Problem with cdhit

I have been trying to run earlGrey with the clustering option, and CD-HIT is failing and not generating an output file. Below is the error which shows it is unable to open the file (output from what i figured)

Program: CD-HIT, V4.8.1 (+OpenMP), Feb 22 2022, 21:26:56
Command: cd-hit-est -d 0 -aS 0.8 -c 0.8 -G 0 -g 1 -b 500 -r 1
-i
/projects/earlgrey/cgun_EarlGrey/cgun_strainer/TS_cgun-families.fa_5528/cgun-families.fa.strained
-o
/projects/earlgrey/cgun_EarlGrey/cgun_strainer/TS*/cgun-families.fa.strained.clstrd.fa

================================================================
Output

total seq: 53
longest and shortest : 11046 and 48
Total letters: 98615
Sequences have been sorted

Approximated minimal memory consumption:
Sequence : 0M
Buffer : 1 X 56M = 56M
Table : 1 X 16M = 16M
Miscellaneous : 4M
Total : 77M

Table limit with the given memory limit:
Max number of representatives: 230040
Max number of word counting entries: 90263642

comparing sequences from 0 to 53

   53  finished         37  clusters

Approximated maximum memory consumption: 78M
writing new database

Fatal Error:
file opening failed
Program halted !!

If i try the same command with fill path for TS_cgun-families.fa_5528 instead of TS*, it works fine and creates the output file.

cd-hit-est -d 0 -aS 0.8 -c 0.8 -G 0 -g 1 -b 500 -r 1
-i
/projects/earlgrey/cgun_EarlGrey/cgun_strainer/TS_cgun-families.fa_5528/cgun-families.fa.strained
-o
/projects/earlgrey/cgun_EarlGrey/cgun_strainer/TS*/cgun-families.fa.strained.clstrd.fa

Please let me know how it can be fixed so it doesn't fail while running the whole earlGrey pipeline.

Thanks
Bushra

No RC / Helitron motifs?

I've been using EarlGrey on several plant genomes and noticed an abundance of RC / Helitron annotations around a gene class of interest. However, a closer look at these annotations and I'm not seeing the expected 5' TC or 3' CTRR motifs associated with this class. I looked at other RC / Helitron annotations in the genomes and am seeing the same effect.

Could this be due to how TE borders are defined in EarlGrey? Alternatively, is this due to how TEs are classified in general? I'd imagine if they are classified solely by similarity to curated sequence in e.g., Dfam, then they'll be frequently misannotated. I'm not sure how this would compare the methods such as HelitronScanner which use 5' 3' motif sequence for annotation.

Happy to share data if needed!

error with test file after mamba Installation

Hello, I am having a problem with testing the tool installation with the NC_045808_EarlWorkshop.fasta test file.

I am attaching the log file. Could you please help me solve it?

testEarlGrey.log

Thank you in advance

conda found conflicts

Dear @TobyBaril

I'm trying to install EarGrey, which will be super useful for my thesis, but there is an error with conda during ./configure step.

Checking RepeatMasker and RepeatModeler configuration
Success! RepeatMasker is installed and in PATH
Success! RepeatModeler is installed and in PATH
Collecting package metadata (repodata.json): done
Solving environment: \ 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.

I've installed RepeatMasker and RepeatModeler following "with sudo" instructions. My OS is Ubuntu 18.08, with miniconda3 v23.3.1.

I tried a few times and it's stucked at this step.

My condarc:

auto_activate_base: false
report_errors: false
channels:
  - defaults
channel_priority: strict

I have a reasonable internet connection, a lot of free space, and my conda is working properly for other environments. Do you have any idea how to solve this problem?

Thank you in advance!

EarlGrey fails on large genome

I have been trying to run EarlGrey on my plant genome. Its a chromosome level assembly of 332Mb. The fasta file consists of 42 sequences with largest chromosome being 65Mb. The smallest sequences in the file is 49280bp. All files have simple headings, for example: >DRSE_pseudomolecule_1. When I run this I get an error. Full log below. It seems to be an issue with the faswap.py script or with makeblastdb. As soon as it gets to the second sequence in the fasta file it gives KeyError. The makeblastdb error might be an issue with RepeatModeler (see discussion here). As I said in issue #62, it works fine with the demo file from gitpod. So I dont know if the issue is in my fasta file, the size of the genome or something else. Hope you can help solve this? The .prep and .prep.bak files are empty so I think its an issue with creating these inputs.


    
              )  (
	     (   ) )
	     ) ( (
	   _______)_
	.-'---------|  
       ( C|/\/\/\/\/|
	'-./\/\/\/\/|
	 '_________'
	  '-------'
	<<< Cleaning Genome >>>
grep: /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_9CHR_PGA_assembly_renamed_49k.fasta: binary file matches
Traceback (most recent call last):
  File "/shared/biology/bioldata1/bl-bl1067/Cobus/python/envs/earlgrey_env/share/earlgrey-3.2-0/scripts//faswap.py", line 13, in <module>
    lines=map(lambda x: ">"+a[x[1:]] if x and x[0]==">" else x,lines);print("\n".join(lines))
  File "/shared/biology/bioldata1/bl-bl1067/Cobus/python/envs/earlgrey_env/share/earlgrey-3.2-0/scripts//faswap.py", line 13, in <lambda>
    lines=map(lambda x: ">"+a[x[1:]] if x and x[0]==">" else x,lines);print("\n".join(lines))
KeyError: 'DRSE_pseudomolecule_2'
    
              )  (
	     (   ) )
	     ) ( (
	   _______)_
	.-'---------|  
       ( C|/\/\/\/\/|
	'-./\/\/\/\/|
	 '_________'
	  '-------'
	<<< Detecting Novel Repeats >>>
Building database DRSE_TEanno:
  Reading /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_9CHR_PGA_assembly_renamed_49k.fasta.prep...
Died at /shared/biology/bioldata1/bl-bl1067/Cobus/python/envs/earlgrey_env/bin/BuildDatabase line 331.
The makeblastdb program exited with code 1.  Please check your input file(s) for potential formating errors.
/shared/biology/bioldata1/bl-bl1067/Cobus/python/envs/earlgrey_env/bin/makeblastdb returned: 

Building a new DB, current time: 11/07/2023 10:00:28
New DB name:   /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno
New DB title:  ./ef6ns6B8Mh
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
BLAST options error: File ./ef6ns6B8Mh is empty

The command used was: /shared/biology/bioldata1/bl-bl1067/Cobus/python/envs/earlgrey_env/bin/makeblastdb -blastdb_version 4 -out DRSE_TEanno -parse_seqids -dbtype nucl -in ./ef6ns6B8Mh 2>&1
BLAST Database error: No alias or index file found for nucleotide database [/shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno] in search path [/shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_RepeatModeler::]
RepeatModeler Version 2.0.4
===========================
Using output directory = /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_RepeatModeler/RM_185498.TueNov71000292023
Search Engine = rmblast 2.14.1+
Threads = 20
Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.5
LTR Structural Analysis: Disabled [use -LTRStruct to enable]
Random Number Seed: 1699351229
Database = /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno 
Database /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno does not contain any sequences!
ERROR: RepeatModeler Failed, Retrying with limit set as Round 5
BLAST Database error: No alias or index file found for nucleotide database [/shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno] in search path [/shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_RepeatModeler::]
RepeatModeler Version 2.0.4
===========================
Using output directory = /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_RepeatModeler/RM_185506.TueNov71000292023
Search Engine = rmblast 2.14.1+
Threads = 20
Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.5
LTR Structural Analysis: Disabled [use -LTRStruct to enable]
Random Number Seed: 1699351229
Database = /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno 
Database /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno does not contain any sequences!
ERROR: RepeatModeler Failed, Retrying with limit set as Round 4
BLAST Database error: No alias or index file found for nucleotide database [/shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno] in search path [/shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_RepeatModeler::]
RepeatModeler Version 2.0.4
===========================
Using output directory = /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_RepeatModeler/RM_185514.TueNov71000302023
Search Engine = rmblast 2.14.1+
Threads = 20
Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.5
LTR Structural Analysis: Disabled [use -LTRStruct to enable]
Random Number Seed: 1699351230
Database = /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno 
Database /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno does not contain any sequences!
ERROR: RepeatModeler Failed

Soft masking option

Hi Toby,

Is it possible to get also a soft masked version of the masked file after running the final RepeatMasker step?
Without needing to run it again? Or loosing the N masked version?

Error in repeat merging, missing '[...].rmerge.gff.filtered'?

Hello @TobyBaril,

I (excitedly!) tried running the earlGrey version from bioconda
(Specifically, I'm using charliecloud to run the Docker biocontainer hosted here, built automatically from bioconda; i assume this is identical to running inside a conda env)

I ran EarlGrey as earlGrey -s otipulae -o . -g genome.fasta -t 24, the genome.fasta is available here

And I got the error log below, upon several attempts:

(Mem limit: 16GB, on slurm cluster, submitted from a nextflow workflow)

Error in fread(file) :
  File '/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.rmerge.gff.filtered' does not exist or is non-readable. getwd()=='/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats'
Execution halted
cp: can't stat '/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.filteredRepeats.bed': No such file or directory
Traceback (most recent call last):
  File "/usr/local/share/earlgrey-3.1-0/scripts//backSwap.py", line 14, in <module>
    table = pd.read_csv(input, names = ['scaf', 'start', 'end', 'repeat', 'score', 'strand'], delim_whitespace = True, header = None)
  File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 577, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1407, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1661, in _make_engine
    self.handles = get_handle(
  File "/usr/local/lib/python3.8/site-packages/pandas/io/common.py", line 859, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.filteredRepeats.bed'
mv: can't rename '/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.filteredRepeats.bed.2': No such file or directory
Traceback (most recent call last):
  File "/usr/local/share/earlgrey-3.1-0/scripts//backSwapGFF.py", line 14, in <module>
    table = pd.read_csv(input, names = ['scaf', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'], delim_whitespace = True, header = None)
  File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 577, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1407, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1661, in _make_engine
    self.handles = get_handle(
  File "/usr/local/lib/python3.8/site-packages/pandas/io/common.py", line 859, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.rmerge.gff.filtered'
mv: can't rename '/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.rmerge.gff.filtered.2': No such file or directory
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning message:
Failed to locate timezone database
[1] "/usr/local/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/usr/local/share/earlgrey-3.1-0/scripts//makeGff.R"
[5] "--args"
[6] "/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.filteredRepeats.bed"
[7] "/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.rmerge.gff.filtered"
[8] "/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.filteredRepeats.gff"
Error in file(file, "rt") : cannot open the connection
Calls: read.table -> file
In addition: Warning message:
In file(file, "rt") :
  cannot open file '/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.filteredRepeats.bed': No such file or directory
Execution halted

              )  (
             (   ) )
             ) ( (
           _______)_
        .-'---------|
       ( C|/\/\/\/\/|
        '-./\/\/\/\/|
         '_________'
          '-------'
        <<< Done! >>>
ERROR: strict merge also failed, check /scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_RepeatMasker_Against_Custom_Library/genome.fasta.prep.out looks as expected

Do you know what this is due to?

Here's a link to the full output log file, if useful: https://drive.google.com/file/d/1drbTh1HU2Nz793zklukcfBVHxrl0abae/view?usp=sharing

Thanks and best

R Package Incompatibilities

Hello!

I see that in earlGrey.yml that the version of R used in earlGrey is 4.1.
However, install_r_packages.R installs version 3.15 of Bioconductor in my setup, which works with version 4.2 of R.
This leads to the last line of install_r_packages.R failing with the message:

Error: Bioconductor version '3.15' requires R version '4.2'; use `version = '3.14'`

The following errors/warnings also show up down the line when running earlGrey:

Trimming and sorting based on mreps, TRF, SA-SSR
Warning messages:
1: package ‘plyranges’ was built under R version 4.2.1 
2: package ‘BiocGenerics’ was built under R version 4.2.1 
3: package ‘IRanges’ was built under R version 4.2.1 
4: package ‘S4Vectors’ was built under R version 4.2.1 
5: package ‘GenomicRanges’ was built under R version 4.2.1 
6: package ‘GenomeInfoDb’ was built under R version 4.2.1 
Warning messages:
1: package ‘BSgenome’ was built under R version 4.2.1 
2: package ‘Biostrings’ was built under R version 4.2.1 
3: package ‘XVector’ was built under R version 4.2.1 
4: package ‘rtracklayer’ was built under R version 4.2.1 
5: no function found corresponding to methods exports from ‘BSgenome’ for: ‘releaseName’ 
Error in ..Internal(is.unsorted(x, FALSE, FALSE)) : 
  3 arguments passed to .Internal(is.unsorted) which requires 2
Calls: %>% ... <Anonymous> -> .splitAsList_by_integer_Rle -> <Anonymous>
Execution halted
Removing temporary files
Reclassifying repeats

I could downgrade Bioconductor to 3.14, but R warns me that I would be downgrading 828 packages.
What are the intended versions earlGrey should be working with here? Not sure if I am doing something wrong or if there is some issue with package repositories having been updated.

How to appropriately label or mention the figure of "summaryFiles/Repeat Landscape"?

Hello, everyone,

Thank you for your assistance in resolving my previous issue (#45)!

I would now like to incorporate this figure into both my project process PowerPoint and my upcoming scientific paper. However, I'm uncertain about how to appropriately label it. Could you please suggest a suitable name for the figure?

To be completely honest, I'm not fully comprehending the content of this figure as it falls outside my area of expertise. Could it possibly be a TE distribution? I've come across similar terms like "Transposable element (TE) sequence distribution based on Kimura distance." However, I believe it would be best to seek your guidance and expertise on this matter.

I would greatly appreciate it if you could provide your input as soon as possible. Thank you!

Best regards,
Zoe

How generate a more detailed summary！

Hello, author, EarlGrey's one command has brought convenience to our genome annotation. I struggled to run RepeatMasker and RepeatModel, and they gave such a result. How can I get a similar table from EarlGrey's results for comparison?

Thanks,and the best wish for you !

Adding EarlGrey to bioconda

Hi,
Thanks again for the interesting workshop for BGA23! As promised, I am looking into writing a bioconda recipe for this: https://github.com/dirkjanvw/bioconda-recipes/blob/add_earlgrey_v3.0/recipes/earlgrey It is not finished for sure, but I am wondering whether you could provide me with a fasta file (on ENA or NCBI) that is reasonably small which I can use for locally testing the bioconda recipe and local installation. That way I can make sure everything with the bioconda recipe will be correct before I submit it in a PR to bioconda-recipes.

Also, should you have any feedback or questions about the recipe, please let me know!

TE strainer gets stuck?

I've managed to run ~20 genomes with earlGrey and everything has been running pretty smoothly (either newly assembled or downloaded from NCBI). So the tool has been awesome.

But I've noticed something strange happening due to a combination repeatmodeler, TE strainer, and potentially darwin Tree of Life (dToL) genomes ...

Its so far happened to 2 genomes (currently running more).

bombus terrestris fasta from the dToL project, downloaded from here, GCA_910591885.2: https://portal.darwintreeoflife.org/data/root/details/Bombus%20terrestris

and
anoplius nigerrimus, downloaded from here:
https://portal.darwintreeoflife.org/data/root/details/Anoplius%20nigerrimus

Basically TE strainer gets stuck blasting a particular sequence. I've re-ran repeatmodeler, re-ran the pipeline using a freshly downloaded fasta. Checked the genome fasta headers to see if they were problematic. They all get stuck and TE strainer goes on infinitely ... (longest TE strainer run so far has been ~14h with 44 threads, before I killed it).

The earlGrey logs so far (on my 3rd attempt), genome fasta, the repeatmodeler sequence that it is getting stuck on are all available here: https://www.dropbox.com/sh/co44hhg0l3mh890/AACEpm6uMkwd0OmLvM7QVfMPa?dl=0

I did a quick scan of the fasta and for bombus the only thing I gleaned is that the TE might be at the end of a contig?

just curious if anything pops out to you as a problem, or if you've run into this issue before ...

cheers.

RepeatModeler error

Hi,

I am running earlGrey on drosophila simulans using the reference genome from NCBI: https://ftp.ncbi.nih.gov/genomes/refseq/invertebrate/Drosophila_simulans/representative/GCF_016746395.2_Prin_Dsim_3.1/GCF_016746395.2_Prin_Dsim_3.1_genomic.fna.gz

But it got an error as following:

.......
Processing RECON family: 723
  - Saving 16 elements
  - Refining family-723 model...
Family Refinement: 00:22:04 (hh:mm:ss) Elapsed Time
Round Time: 02:52:44 (hh:mm:ss) Elapsed Time
   - Increasing sample size to include end piece now = 250032285


RepeatModeler Round # 6
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 243000000 bp
FastaDB::compact - Error could not locate file /data/home/xxx/dsim/repeatAnnotation/DsimEarlGrey/Dsim_RepeatModeler/RM_131689.ThuNov110910292021/round-6/sampleDB-6.fa!
 at /data/home/xxx/RepeatModeler-2.0.2a/RepeatModeler line 862.
ERROR: RepeatModeler Failed

The error is likely to be caused by small/fragemented reference genome according to Dfam-consortium/RepeatModeler#111. I tried some other reference geneomes and I didn't find such an error.

Could it be fixed in earlGrey or suggest some solutions for the problem?

Thanks

Yiguan

No repetitive sequences detected error

I am trying to run EarlGrey on a plant genome, but I am getting an error:

No repetitive sequences were detected in /scratch/botany/katie/assembled_genomes/assemblies/sandwicensis/working/sandwicensis.fasta.prep
ERROR: RepeatMasker failed, please check logs. This is likely because of an invalid species search term, if issue persists please use NCBI Taxids (E.G Drosophila is replaced with 7125)

I have used the following command within a slurm file which was submitted to the cluster:

species="sandwicensis"
earlGrey -g /scratch/botany/katie/assembled_genomes/assemblies/$species/working/$species.fasta -s $species -o /scratch/botany/katie/maker/earlgrey/$species -r arabidopsis -t 32

I have tried to replace the -r option withj the taxid of arabidopsis but this does not seem to work either.
When I try to run RepeatMasker alone on the .fasta.prep file created by EarlGrey, it seems to run fine, using the following command:

species="sandwicensis"
RepeatMasker -species arabidopsis /scratch/botany/katie/assembled_genomes/assemblies/sandwicensis/working/sandwicensis.fasta.prep

The output looks like this:

 ____________________
< Checking Parameters >
 --------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/                
                ||----w |
                ||     ||
De Novo Sequences Will Be Extended Through 5 Iterations
Clusters Will Be Considered When TEs Are <100bp Apart
Blast, Extract, Extend Process Will Add 1000bp to Each End in Each Iteration
Conda environment is active
 ____________________
< Making Directories >
 --------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/                
                ||----w |
                ||     ||
 ____________________
< Cleaning Genome >
 --------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/                
                ||----w |
                ||     ||
 ____________________
< Getting RepeatMasker Sequences for arabidopsis and Saving as Fasta >
 --------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/                
                ||----w |
                ||     ||
 ____________________
< Running Initial Mask with RepBase >
 --------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/                
                ||----w |
                ||     ||
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = (unset),
	LC_ALL = (unset),
	LC_CTYPE = "UTF-8",
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").
RepeatMasker version 4.1.2
Search Engine: HMMER [ 3.3 (Nov 2019) ]

Using Master RepeatMasker Database: /apps/repeatmasker/4.1.2/Libraries/RepeatMaskerLib.h5
  Title    : Dfam
  Version  : 3.3
  Date     : 2020-11-09
  Families : 6,953

Species/Taxa Search:
  Arabidopsis [NCBI Taxonomy ID: 3701]
  Lineage: root;cellular organisms;Eukaryota;Viridiplantae;
           Streptophyta;Streptophytina;Embryophyta;Tracheophyta;
           Euphyllophyta;Spermatophyta;Magnoliopsida;Mesangiospermae;
           eudicotyledons;Gunneridae;Pentapetalae;rosids;malvids
  9 families in ancestor taxa; 0 lineage-specific families

Building species libraries in: /home/user/emelianova/.RepeatMaskerCache/HMM-Dfam_3.3/arabidopsis

analyzing file /scratch/botany/katie/assembled_genomes/assemblies/sandwicensis/working/sandwicensis.fasta.prep
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = (unset),
	LC_ALL = (unset),
	LC_CTYPE = "UTF-8",
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").

No repetitive sequences were detected in /scratch/botany/katie/assembled_genomes/assemblies/sandwicensis/working/sandwicensis.fasta.prep
ERROR: RepeatMasker failed, please check logs. This is likely because of an invalid species search term, if issue persists please use NCBI Taxids (E.G Drosophila is replaced with 7125)

Is there anything I am missing or should check?

ERROR: RepeatMasker failed

hey can you help me with that issue?

RepeatMasker works on my computer well.

(earlGrey) marcin@marcin-B550-AORUS-ELITE-AX-V2:~/EarlGrey$ ./earlGrey -g bombyxMori.fasta. -s bombyxMori -o ./repeatAnnotation/ -r arthropoda -t 16

bombyxMoriEarlGrey.log

usage questions about softmasking/threads/consistency

Hi,

I tried to run earlGrey locally by'mamba install earlgrey -c bioconda -c conda-forge` and updated to the 4.0 version on conda.

About softmasking. Softmasking seems not working? first run I tried
nohup /usr/bin/time -v earlGrey -g ne_1019.fa -s v1019 -o ./repeatAnnotation -r hemiptera -t 20 -d yes > 1.log 2>2.err & but found
/home/s76/mambaforge/envs/earlgrey/bin/earlGrey: illegal option -- d in 2.err
Softmasked genome will not be generated in 1.log
About threads. In some other pipelines, like EDTA/LTR_retriever the -t option should be 1/4 of the all free threads to utilize because the default rmblastn already runs 4 threads per process. In earlygrey, it seems working differently. and I want to apply more threads to save time. I killed the process, did not remove any generated files, and reran with -t 80 to see if runtime could be reduced.
nohup /usr/bin/time -v earlGrey -g ne_1019.fa -s v1019 -o ./repeatAnnotation -r hemiptera -t 80 -d yes > 1.log 2>2.err &

About consistency. In the second run with -t 80, I noticed * families discovered of RepeatModeler Round #2 in rmod.log is different from the first run with -t 20. I killed the process, did not remove any generated files, and rerun with -t 92 to see if results differed. * families discovered for -t 20, 80, and 92 are 7, 8, and 4, respectively.

3.1 rmod.log of first run with -t 20

RepeatModeler Round # 2
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 10000000 bp
   - Sequence extraction : 00:00:08 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
   - TRFMask time 00:00:15 (hh:mm:ss) Elapsed Time
 -- Masking repeats from the previous rounds...
       15159 repeats masked totaling 2566426 bp(s).
   - TE Masking time 00:00:15 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 10442024 bp
       Num Contigs Represented = 17
       Non ambiguous bp:
             Initial: 10005258 bp
             After Masking: 7313959 bp
             Masked: 26.90 % 
 -- Input Database Coverage: 10442024 bp out of 525502193 bp ( 1.99 % )
Sampling Time: 00:00:38 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
  - Total Comparisons = 33930
Comparison Time: 00:03:01 (hh:mm:ss) Elapsed Time, 5226 HSPs Collected
Number of families returned by RECON: 1108
Round Time: 00:03:44 (hh:mm:ss) Elapsed Time : 7 families discovered.

3.2 rmod.log of second run with -t 80

 RepeatModeler Round # 2
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 10000000 bp
   - Sequence extraction : 00:00:08 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
   - TRFMask time 00:00:15 (hh:mm:ss) Elapsed Time
 -- Masking repeats from the previous rounds...
       15753 repeats masked totaling 2635065 bp(s).
   - TE Masking time 00:00:07 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 10480609 bp
       Num Contigs Represented = 15
       Non ambiguous bp:
             Initial: 10014725 bp
             After Masking: 7235025 bp
             Masked: 27.76 % 
 -- Input Database Coverage: 10480609 bp out of 525502193 bp ( 1.99 % )
Sampling Time: 00:00:31 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
  - Total Comparisons = 34191
Comparison Time: 00:01:55 (hh:mm:ss) Elapsed Time, 5866 HSPs Collected
Number of families returned by RECON: 1187
Round Time: 00:02:46 (hh:mm:ss) Elapsed Time : 8 families discovered.

3.3 rmod.log of second run with -t 92

RepeatModeler Round # 2
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 10000000 bp
   - Sequence extraction : 00:00:09 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
   - TRFMask time 00:00:16 (hh:mm:ss) Elapsed Time
 -- Masking repeats from the previous rounds...
       15858 repeats masked totaling 2733407 bp(s).
   - TE Masking time 00:00:08 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 10480663 bp
       Num Contigs Represented = 17
       Non ambiguous bp:
             Initial: 10007133 bp
             After Masking: 7117176 bp
             Masked: 28.88 % 
 -- Input Database Coverage: 10480663 bp out of 525502193 bp ( 1.99 % )
Sampling Time: 00:00:34 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
  - Total Comparisons = 34191
Comparison Time: 00:01:54 (hh:mm:ss) Elapsed Time, 5194 HSPs Collected
Number of families returned by RECON: 1099
Round Time: 00:02:33 (hh:mm:ss) Elapsed Time : 4 families discovered.

.

Docker version failing at RM step

Hi,
I installed the docker version as described in the docker section of the Readme, but it keeps crashing with ERROR: RepeatModeler Failed
I have successfully run the same genome after installing EarlGrey as described in the with sudo section.
Attached is the log file from the failing run.
AlentisEarlGrey.log
As well as the successful run without docker:
AlentisEarlGrey_good.log
Cheers.

I had a quick look in the files and line 2474 of the failed run says Missing /opt/RepeatMasker/Libraries/RepeatMasker.lib.nsq!
That might point to a problem during the build process of the container.
I'll try to build it again from scratch.

RepeatClassifer unknown option: pa

Hi there. While using the latest docker container (pulled on 2023-06-30) and running a test with this command:

earlGrey -g sequence.fasta -s speciesName -o outputDirectory -t 16

this error came up after RepeatModeler:

Trimming and sorting based on mreps, TRF, SA-SSR
Removing temporary files
Reclassifying repeats
Unknown option: pa
/opt/RepeatModeler/RepeatClassifier - 2.0.4

followed by the help screen text of RepeatClassifier.

I checked this against a colleague's older run of EarlGrey and they had

/opt/RepeatModeler/RepeatClassifier - 2.0.2

So my best guess is that this latest version of RepeatClassifier does not have the the -pa option for specifying how many threads to run in parallel. (see https://github.com/Dfam-consortium/RepeatModeler/blob/master/RepeatClassifier and https://github.com/Dfam-consortium/RepeatModeler/blob/master/RepModelConfig.pm as well)

No Curated library and wondering what the different .gff outputs mean

Hello Toby,

Thank you for generating the tool and for the detailed and useful paper besides it! I work on a HPC and we installed earlgrey with a singularity container using the docker file as a template. I have run EarlGrey twice, on Chromosome 1 of two species using: earlGrey -g species_chr1.fas -s species_chr1 -o earlGreyOutputs -t 34

Firstly, unfortunately, both times my species.chr1_Curated_Library directory is empty. Do you have any idea why? Is it possibly because one chromosome does not have enough data for this? All other directories have content. Also my summaryFiles directory is missing the de novo andcombined repeatfasta library.

Secondly, the output .gff file in summaryFiles is: .filteredRepeats.gff but this is very limited and shows hardly any TEs the same as previous runs with EDTA and RepeatMasker. The .gff I found most useful was the: .out.gff3 in the mergedRepeats directory, however the names of the TEs are not contained in the .gff, only "dispersed_repeat" in column 3 rather than a TE classification. For my analysis I need a .gff file with the coordinates and classification, maybe I have to intersect this with the familyLevelCount.txt if there is no other .gff that should have been produced?

For housekeeping:

These are the outputs in my SummaryFiles directory:

-families.fa.strained
.filteredRepeats.gff
.filteredRepeats.bed
.repeatLandscape.pdf
.familyLevelCount.txt
.highLevelCount.txt
.summaryPie.pdf

My results in the mergedRepeats directory:

.filteredRepeats.gff
.rmerge.gff.filtered
.filteredRepeats.bed
.filteredRepeats.bed.bak
.filteredRepeats.summary
.mergedRepeats.revisedTable
.mergedRepeats.bed
.rmerge.gff.sorted
.summary.txt
.rmerge.gff
.rclabel.gff
ltrfinder_reformat_label.gff
_chr1RepeatCraft
.out.gff2
.out.gff3
_chr1LtrFinder

N.B all these files in both directories were created at the same time... I don't know if that helps!

Also, if there is some kind of info page on the outputs that I have missed, please direct me there. I hope my queries make sense.

Best wishes,

Isabella

Unable to use EarlGrey as non root user over ssh. "sh: /dev/tty: No such device or address"

Hello,
Thanks for your work on this package!

I have been having issues trying to use EarlGrey as a non-root user on my university compute cluster. I followed the instructions under Earl Grey Installation and Configuration (If you DO NOT have RepeatMasker and RepeatModeler) - WITHOUT SUDO PRIVILEGES, but I also had to install CDHIT to properly configure RepeatMasker.

The program runs fine until this stage:
<<< Straining TEs and Refining de novo Consensus Sequences >>>
Building a new DB, current time: 08/28/2023 00:43:07
New DB name: /work/users/a/d/adaigle/EarlGrey_experiments/data/ISO1_GCF_000001215.4_Release_6_prepped.fasta.
prep
New DB title: /work/users/a/d/adaigle/EarlGrey_experiments/data/ISO1_GCF_000001215.4_Release_6_prepped.fasta.
prep
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1870 sequences in 2.40207 seconds.

Splitting run 1
Initial trf check for 1
sh: /dev/tty: No such device or address

From my brief googling, I suspect this tty error might have to do with me not having administrative privileges on my cluster, but I'm not sure. It could also be that I am launching this as a SLURM job.

Again the program seems to have run without error (and produced outputs) until this stage. The strainer directory still has outputs, but my Curated Library folder is empty. Let me know if you need any more information!

Docker build instructions fail

Hi Toby,
I've had trouble installing via the docker, with the following output at the configure stage:
./configure: 5: /anaconda3/envs/earlGrey/etc/conda/activate.d/activate-binutils_linux-64.sh: Syntax error: "(" unexpected The command '/bin/sh -c cd /opt/ && git clone https://github.com/TobyBaril/EarlGrey && cd EarlGrey && chmod +x ./configure && eval "$(/anaconda3/bin/conda shell.bash hook)" && ./configure' returned a non-zero code: 2

Potentially this is because the YML file that creates the Anaconda environment inside the container has pinned versions of packages that have a bug?

conda/conda#9959
Let me know this is fixable
Thank you,
Luke

Execution halted and ERROR: strict merge also failed

Hello,
I was trying to use Earl Gray for my genome but at the end the end the execution was halted due to error below:

FileNotFoundError: [Errno 2] No such file or directory: '/home/edelab/pacbio/genome_analysis/earlgray/pb_output/plasmodiophoraBrassicae_EarlGrey/plasmodiophoraBrassicae_mergedRepeats//plasmodiophoraBrassicae.filteredRepeats.bed'
mv: cannot stat '/home/edelab/pacbio/genome_analysis/earlgray/pb_output/plasmodiophoraBrassicae_EarlGrey/plasmodiophoraBrassicae_mergedRepeats//plasmodiophoraBrassicae.filteredRepeats.bed.2': No such file or directory
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.4 ✔ purrr 1.0.2
✔ tibble 3.2.1 ✔ dplyr 1.1.4
✔ tidyr 1.3.0 ✔ stringr 1.5.1
✔ readr 2.1.4 ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Warning messages:
1: package ‘ggplot2’ was built under R version 4.2.3
2: package ‘tibble’ was built under R version 4.2.3
3: package ‘tidyr’ was built under R version 4.2.3
4: package ‘readr’ was built under R version 4.2.3
5: package ‘purrr’ was built under R version 4.2.3
6: package ‘dplyr’ was built under R version 4.2.3
7: package ‘stringr’ was built under R version 4.2.3
8: package ‘forcats’ was built under R version 4.2.3
[1] "/home/edelab/miniconda3/envs/earlgrey/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/home/edelab/miniconda3/envs/earlgrey/share/earlgrey-4.0-0/scripts//makeGff.R"
[5] "--args"
[6] "/home/edelab/pacbio/genome_analysis/earlgray/pb_output/plasmodiophoraBrassicae_EarlGrey/plasmodiophoraBrassicae_mergedRepeats//plasmodiophoraBrassicae.filteredRepeats.bed"
[7] "/home/edelab/pacbio/genome_analysis/earlgray/pb_output/plasmodiophoraBrassicae_EarlGrey/plasmodiophoraBrassicae_mergedRepeats//plasmodiophoraBrassicae.rmerge.gff.filtered"
[8] "/home/edelab/pacbio/genome_analysis/earlgray/pb_output/plasmodiophoraBrassicae_EarlGrey/plasmodiophoraBrassicae_mergedRepeats//plasmodiophoraBrassicae.filteredRepeats.gff"
Error in file(file, "rt") : cannot open the connection
Calls: read.table -> file
In addition: Warning message:
In file(file, "rt") :
cannot open file '/home/edelab/pacbio/genome_analysis/earlgray/pb_output/plasmodiophoraBrassicae_EarlGrey/plasmodiophoraBrassicae_mergedRepeats//plasmodiophoraBrassicae.filteredRepeats.bed': No such file or directory
Execution halted

          )  (
         (   ) )
         ) ( (
       _______)_
    .-'---------|
   ( C|/\/\/\/\/|
    '-./\/\/\/\/|
     '_________'
      '-------'
    <<< Done! >>>

ERROR: strict merge also failed, check /home/edelab/pacbio/genome_analysis/earlgray/pb_output/plasmodiophoraBrassicae_EarlGrey/plasmodiophoraBrassicae_RepeatMasker_Against_Custom_Library/pb3A_assembly.fasta.prep.out looks as expected

Can you help me in this regard!

Thanks

How can i get the masked genome to do the next annotation step,eg:Braker2

As i see there are _.filteredRepeats.bed _.highLevelCount.txt _.summaryPie.pdf
_de_novo_repeat_library_iter4.fasta.clustered.fa _.filteredRepeats.gff _.repeatLandscape.pdf in the __summaryFiles/
and
_.clusTErs.bed _.detailedClusTErs.bed in the _clusTErs/
could you please tell me how to get the genome that has been masked repeating sequences to do the next annotation step,eg:Braker2,thank you!!

Resume in-progress run

Hello! One of my earlGrey runs died several days into its runtime, and I was hoping to avoid having to wait for all that work to complete again. Is there a way to re-enter the pipeline at a particular point?

It looks like my run failed at Trimming and sorting based on mreps, TRF, SA-SSR, which makes sense given my R issues in #57.
The log following this section reads:


Trimming and sorting based on mreps, TRF, SA-SSR
Warning messages:
1: package ‘plyranges’ was built under R version 4.2.1 
2: package ‘BiocGenerics’ was built under R version 4.2.1 
3: package ‘IRanges’ was built under R version 4.2.1 
4: package ‘S4Vectors’ was built under R version 4.2.1 
5: package ‘GenomicRanges’ was built under R version 4.2.1 
6: package ‘GenomeInfoDb’ was built under R version 4.2.1 
Warning messages:
1: package ‘BSgenome’ was built under R version 4.2.1 
2: package ‘Biostrings’ was built under R version 4.2.1 
3: package ‘XVector’ was built under R version 4.2.1 
4: package ‘rtracklayer’ was built under R version 4.2.1 
5: no function found corresponding to methods exports from ‘BSgenome’ for: ‘releaseName’ 
Error in ..Internal(is.unsorted(x, FALSE, FALSE)) : 
  3 arguments passed to .Internal(is.unsorted) which requires 2
Calls: %>% ... <Anonymous> -> .splitAsList_by_integer_Rle -> <Anonymous>
Execution halted
Removing temporary files
Reclassifying repeats
cp: cannot stat 'TS_Tagetes_erecta-families.fa_8767/trf/Tagetes_erecta-families.fa.nonsatellite': No such file or directory
No database indicated or it is an empty file.
/lab/solexa_weng/testtube/matthew/earlgrey/RepeatModeler-2.0.4/RepeatClassifier - 2.0.4

... [the help info for repeatclassifier is printed here]

/lab/solexa_weng/Documents/matthew/scripts/test/earlGrey/Tagetes_erecta_EarlGrey/Tagetes_erecta_strainer
Compiling library
cat: TS_Tagetes_erecta-families.fa_8767/trf/Tagetes_erecta-families.fa.satellites: No such file or directory
ERROR: TEstrainer failed to produce a strain file, please check the log file for more information
slurmstepd: error: Detected 8 oom-kill event(s) in StepId=645700.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

TEstrainer_for_earlGrey and /dev/tty

Hi Toby
I'm trying to use earlGrey to generate a set of libraries for further curation. However, I'm facing an inconvenience.
After the de novo prediction and classification, when earlGrey use TEstrainer_for_earlGrey.sh, I obtain a message like this:

Splitting run 1
Initial trf check for 1
sh: 1: cannot open /dev/tty: No such device or address
5% 157:2521=0s rnd-1_family-78.fasta                                            sh: 1: cannot open /dev/tty: No such device or address
13% 360:2318=6s rnd-1_family-358.fasta                                          sh: 1: cannot open /dev/tty: No such device or address
20% 560:2118=7s rnd-1_family-561.fasta                                          sh: 1: cannot open /dev/tty: No such device or address
28% 762:1916=7s rnd-1_family-744.fasta                                          sh: 1: cannot open /dev/tty: No such device or address
36% 966:1712=7s rnd-1_family-967.fasta                                          sh: 1: cannot open /dev/tty: No such device or address
43% 1170:1508=6s rnd-4_family-1559.fasta                                        sh: 1: cannot open /dev/tty: No such device or address
51% 1368:1310=5s rnd-4_family-1301.fasta                                        sh: 1: cannot open /dev/tty: No such device or address
58% 1570:1108=4s rnd-4_family-2876.fasta                                        sh: 1: cannot open /dev/tty: No such device or address
66% 1773:905=4s rnd-5_family-4154.fasta                                         sh: 1: cannot open /dev/tty: No such device or address
73% 1975:703=3s rnd-5_family-699.fasta                                          sh: 1: cannot open /dev/tty: No such device or address
81% 2178:500=2s rnd-5_family-5375.fasta                                         sh: 1: cannot open /dev/tty: No such device or address
88% 2379:299=1s rnd-5_family-2862.fasta                                         sh: 1: cannot open /dev/tty: No such device or address
96% 2583:95=0s rnd-5_family-6095.fasta                                          sh: 1: cannot open /dev/tty: No such device or address
100% 2678:0=0s rnd-5_family-12418.fasta
Initial blast and preparation for MSA 1
sh: 1: cannot open /dev/tty: No such device or address
0% 0:2678=0s rnd-1_family-21.fasta                                              Whoa boy, rnd-1_family-134#DNA/Zisupton is inside tandem repeat
0% 1:2677=0s rnd-1_family-19.fasta                                              sh: 1: cannot open /dev/tty: No such device or address
0% 1:2677=44m37s rnd-1_family-19.fasta                                          sh: 1: cannot open /dev/tty: No such device or address
0% 1:2677=44m37s rnd-1_family-19.fasta                                          sh: 1: cannot open /dev/tty: No such device or address
0% 1:2677=44m37s rnd-1_family-19.fasta                                          sh: 1: cannot open /dev/tty: No such device or address
0% 1:2677=44m37s rnd-1_family-19.fasta                                          sh: 1: cannot open /dev/tty: No such device or address
0% 1:2677=44m37s rnd-1_family-19.fasta                                          sh: 1: cannot open /dev/tty: No such device or address
0% 1:2677=44m37s rnd-1_family-19.fasta                                          sh: 1: cannot open /dev/tty: No such device or address
0% 5:2673=44m37s rnd-1_family-38.fasta                                          sh: 1: cannot open /dev/tty: No such device or address
0% 11:2667=49m00s rnd-1_family-31.fasta                                         sh: 1: cannot open /dev/tty: No such device or address
0% 13:2665=45m49s rnd-1_family-55.fasta                                         sh: 1: cannot open /dev/tty: No such device or address
0% 13:2665=43m02s rnd-1_family-55.fasta                                         sh: 1: cannot open /dev/tty: No such device or address
0% 14:2664=42m14s rnd-1_family-68.fasta                                         sh: 1: cannot open /dev/tty: No such device or address
0% 14:2664=42m01s rnd-1_family-68.fasta                                         sh: 1: cannot open /dev/tty: No such device or address
0% 14:2664=42m00s rnd-1_family-68.fasta                                         sh: 1: cannot open /dev/tty: No such device or address
0% 16:2662=41m40s rnd-1_family-161.fasta                                        sh: 1: cannot open /dev/tty: No such device or address
0% 18:2660=41m03s rnd-1_family-87.fasta                                         sh: 1: cannot open /dev/tty: No such device or address
0% 22:2656=39m00s rnd-1_family-156.fasta                                        sh: 1: cannot open /dev/tty: No such device or address
0% 23:2655=37m20s rnd-1_family-8.fasta                                          sh: 1: cannot open /dev/tty: No such device or address
0% 24:2654=36m50s rnd-1_family-81.fasta                                         sh: 1: cannot open /dev/tty: No such device or address
0% 25:2653=36m20s rnd-1_family-63.fasta                                         sh: 1: cannot open /dev/tty: No such device or address
0% 26:2652=35m57s rnd-1_family-40.fasta                                         sh: 1: cannot open /dev/tty: No such device or address
1% 27:2651=35m45s rnd-1_family-32.fasta                                         sh: 1: cannot open /dev/tty: No such device or address
1% 30:2648=35m18s rnd-1_family-197.fasta                                        sh: 1: cannot open /dev/tty: No such device or address
1% 33:2645=34m03s rnd-1_family-45.fasta                                         sh: 1: cannot open /dev/tty: No such device or address
1% 34:2644=33m12s rnd-1_family-130.fasta                                        sh: 1: cannot open /dev/tty: No such device or address
1% 35:2643=32m50s rnd-1_family-1.fasta                                          sh: 1: cannot open /dev/tty: No such device or address
1% 36:2642=32m36s rnd-1_family-128.fasta                                        sh: 1: cannot open /dev/tty: No such device or address
1% 36:2642=32m36s rnd-1_family-128.fasta                                        sh: 1: cannot open /dev/tty: No such device or address
1% 37:2641=32m36s rnd-1_family-153.fasta                                        sh: 1: cannot open /dev/tty: No such device or address
1% 37:2641=32m36s rnd-1_family-153.fasta                                        sh: 1: cannot open /dev/tty: No such device or address
1% 39:2639=32m36s rnd-1_family-115.fasta                                        sh: 1: cannot open /dev/tty: No such device or address
1% 39:2639=32m36s rnd-1_family-115.fasta                                        sh: 1: cannot open /dev/tty: No such device or address
1% 41:2637=32m36s rnd-1_family-91.fasta                                         sh: 1: cannot open /dev/tty: No such device or address
1% 44:2634=32m36s rnd-1_family-121.fasta                                        sh: 1: cannot open /dev/tty: No such device or address
1% 50:2628=31m57s rnd-1_family-182.fasta                                        sh: 1: cannot open /dev/tty: No such device or address
1% 51:2627=31m30s rnd-1_family-58.fasta                                         sh: 1: cannot open /dev/tty: No such device or address

Do you have any clue about what is going on?
Thanks for your support!

How modify the summaryFiles/Repeat Landscape showing TE activity (PDF)?

After obtaining the result, I would like to remove the unclassified and non-repeat bar. Could you please guide me on how to do that? Additionally, I am wondering which file I should utilize to generate a new Repeat Landscape plot.

6th column of GFF output

Hi all,
This is a naive question, as I'm not personally a user of EarlGrey, but am interpreting output produced by a collaborator.
What is the meaning of the 'score' column of the GFF output? Is it Kimura distance?
I apologise if this information was included in the documentation but I missed it.
Thank you in advance,
Luke

/workspace/EarlGrey/earlGrey: line 83: famdb.py: command not found

Hi,

Thank you for making earlGrey. I tried to use it through gitpod and wanted to include RepBase.
I uploaded RMRBSeqs.embl and RMRBSeqs.embl in /workspace/conda/envs/earlGrey/share/RepeatMasker/Libraries
and run perl ./configure

(earlGrey) gitpod /workspace/conda/envs/earlGrey/share/RepeatMasker $  perl ./configure

Enter Selection: 5
Building FASTA version of RepeatMasker.lib ........................................
Building RMBlast frozen libraries..
The program is installed with a the following repeat libraries:
File: /workspace/conda/envs/earlGrey/share/RepeatMasker/Libraries/RepeatMaskerLib.h5
FamDB Generator: famdb.py v0.4.2
FamDB Format Version: 0.5
FamDB Creation Date: 2023-01-08 10:42:05.645898

Database: Dfam withRBRM
Version: 3.7
Date: 2023-01-11

Dfam - A database of transposable element (TE) sequence alignments and HMMs.
RBRM - RepBase RepeatMasker Edition - version 20181026

Total consensus sequences: 64595
Total HMMs: 19730


Further documentation on the program may be found here:
  /workspace/conda/envs/earlGrey/share/RepeatMasker/repeatmasker.help

It changed RMBlast: [ Configured, Default ] and leaving others unconfigured

and I run the command
nohup /usr/bin/time -v earlGrey -g /workspace/ne_1019.fa -s v1019 -o ./repeatAnnotation -r insecta -t 4 > 1.log 2>2.err &
I noticed in logfile it reported
/workspace/EarlGrey/earlGrey: line 83: famdb.py: command not found

Is it a bug or it's ok to ignore? Am I doing the configuration/running right?

1.log

In these clusters: _clusTErs,_mergedRepeats,**_summaryFiles, you get the following error:

Progress:3571426/3571437...
Progress:3571427/3571437...
Progress:3571428/3571437...
Progress:3571429/3571437...
Progress:3571430/3571437...
Progress:3571431/3571437...
Progress:3571432/3571437...
Progress:3571433/3571437...
Progress:3571434/3571437...
Progress:3571435/3571437...
Progress:3571436/3571437...Step 5: Merging GFF records by labels...
Step 6: Writing stat file..Removing tmp files...
Done
Traceback (most recent call last):
File "/home/dell/EarlGrey/scripts/repeatCraft/repeatcraft.py", line 187, in
rcStatm.rcstat(rclabelp=outputnamelabel,rmergep=outputnamemerge,outfile= statfname, ltrgroup = True)
File "/home/dell/EarlGrey/scripts/repeatCraft/helper/rcStatm.py", line 54, in rcstat
if rowRaw.get(col[2]):
IndexError: list index out of range

< Resolving Overlapping Repeats >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

Loading required package: stats4
Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, append, as.data.frame, basename, cbind, colnames,
dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

expand.grid, I, unname

Loading required package: IRanges
Loading required package: GenomeInfoDb
Warning messages:
1: package ‘GenomicRanges’ was built under R version 4.1.2
2: package ‘S4Vectors’ was built under R version 4.1.2
3: package ‘IRanges’ was built under R version 4.1.2
Warning message:
package ‘ape’ was built under R version 4.1.2
[1] "/home/dell/miniconda3/envs/earlGrey/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/home/dell/EarlGrey/scripts/filteringOverlappingRepeats.R"
[5] "--args"
[6] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.rmerge.gff.sorted"
[7] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.rmerge.gff.filtered"
Error: package or namespace load failed for ‘tidyverse’:
package ‘rlang’ was installed before R 4.0.0: please re-install it
Execution halted
cp: cannot stat '/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed': No such file or directory
Traceback (most recent call last):
File "/home/dell/EarlGrey/scripts/backSwap.py", line 14, in
table = pd.read_csv(input, names = ['scaf', 'start', 'end', 'repeat', 'score', 'strand'], delim_whitespace = True, header = None)
File "/home/dell/miniconda3/envs/earlGrey/lib/python3.6/site-packages/pandas/io/parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/dell/miniconda3/envs/earlGrey/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/home/dell/miniconda3/envs/earlGrey/lib/python3.6/site-packages/pandas/io/parsers.py", line 948, in init
self._make_engine(self.engine)
File "/home/dell/miniconda3/envs/earlGrey/lib/python3.6/site-packages/pandas/io/parsers.py", line 1180, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/dell/miniconda3/envs/earlGrey/lib/python3.6/site-packages/pandas/io/parsers.py", line 2010, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 382, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] No such file or directory: '/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed'
mv: cannot stat '/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed.2': No such file or directory
Error: package or namespace load failed for ‘tidyverse’:
package ‘rlang’ was installed before R 4.0.0: please re-install it
Execution halted

< Done! >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

< Generating Summary Plots >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

Error: package or namespace load failed for ‘tidyverse’:
package ‘rlang’ was installed before R 4.0.0: please re-install it
Execution halted
Error: package or namespace load failed for ‘tidyverse’:
package ‘rlang’ was installed before R 4.0.0: please re-install it
Execution halted

< Identifying TE Clusters and Member Sequences >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

Error: Unable to open file /home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed. Exiting.
Error: The requested file (/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed) could not be opened. Error message: (No such file or directory). Exiting!

< Tidying Directories and Organising Important Files >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

cp: cannot stat '/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed': No such file or directory
cp: cannot stat '/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.gff': No such file or directory

< Done in 01:05:05.00 >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

< De Novo Library, Combined Library, Summary Figures, and TE Quantifications in Standard Formats Can Be Found in /home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_summaryFiles/ >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

Could you please tell me how to solve it? thank you!

Low coverage datasets?

Hello, I wonder how you expect to be the performance of EarlGrey with low coverage datasets and how it might compare with RepeatExplorer2? Thank you.

Multiple sample consensus library/polymorphic insertion annotation

Hi,

One of the features I found useful with EDTA was the script for combining consensus repeat libraries across multi-sample datasets. Is there a way to combine Earl Grey results from assemblies of different individuals, or better yet use a Pangenome graph or VCF file as the input?

I produced a fairly simple wrapper script for annotating a VCF with a multi-sample consensus repeat library. Just using RepeatMasker to annotate the inserted sequences in a "pangenome" VCF (i.e. variants called with assembly vs assembly alignments). https://github.com/swomics/VCF_TE_annotate. Maybe this could be tweaked to become a module?

Cheers,
Sam

EarlGrey execution time

Hi Toby,

How long does earlGrey usually take to complete? The genome assembly is about 650MB with 17 chrs. The EarlGrey-v3.0 job (with '-t 8') has been running for over 9 days. Is this normal? Thank you.

misformatted lines in gff output

Hi Toby!

I've been analysing the results from running the latest version of earlGrey and ran into an error when using bedtools coverage with the final gff made by Earl Grey (found in X_summaryFiles).

The error from running bedtools coverage -a species.seqlen.bed -b species.repeats.gff is:

Error: Invalid record in file species.repeats.gff. Record is
LR990257.1 RepeatMasker Unknown 66015 65838 500 - . TSTART=2;TEND=63;ID=RND-1_FAMILY-358;SHORTTE=T

I believe this is because Bedtools expects gffs to be formatted to always have a start coordinate smaller than the end format, which is conventional for gffs. It would be great if this could be fixed so that gffs are consistent.

Best wishes,

Charlotte

[suggestion] docker container

disregard, just made it to the end of the installation description :-)

swapping chromosome names

Hi Toby,

I sometimes have issues with the backSwap.py step. No obvious error messages but I'm not getting the file with properly swapped chr names, which causes issues in the subsequent step. I was wondering if you think this awk liner is doing the correct thing swapping the chr names back:

awk 'BEGIN{FS="\t";OFS="\t"}NR==FNR{a[$2]=$1;next}($1 in a){$1=a[$1]}1'
[dict} {species}.rmerge.gff.filtered > {species}.rmerge.gff.filtered.2

If so is it possible to replace backSwap.py with something similar?
Thanks!

Unable to install EarlGrey

Hi,

I am new to bioinformatics, and was struggling to perform repeat prediction and annotation. I came across the bioarchive paper for EarlGrey, and felt that this tool could help me. Therefore, I firstly installed all the needed packages including repeartmasker and others using conda, then I got the repeat libraries, kept it in the /usr/local/RepeatMasker/Libraries/, and followed the installation instructions when sudo privilage is not allowed. While installing, I got following error message after many packages got installed-
ERROR conda.core.link:_execute(733): An error occurred while installing package 'bioconda::bioconductor-genomeinfodbdata-1.2.7-r41hdfd78af_0'.
Rolling back transaction: done
class: LinkError
message:
post-link script failed for package bioconda::bioconductor-genomeinfodbdata-1.2.7-r41hdfd78af_0
location of failed script: /mnt/HD1/miniconda3/envs/earlGrey/bin/.bioconductor-genomeinfodbdata-post-link.sh
==> script messages <==

==> script output <==
stdout: /mnt/HD1/miniconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
/mnt/HD1/miniconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
/mnt/HD1/miniconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
ERROR: post-link.sh was unable to download any of the following URLs with the md5sum 74c82f26111062a9ceb3c5331088cd56:
https://bioconductor.org/packages/3.14/data/annotation/src/contrib/GenomeInfoDbData_1.2.7.tar.gz
https://bioarchive.galaxyproject.org/GenomeInfoDbData_1.2.7.tar.gz
https://depot.galaxyproject.org/software/bioconductor-genomeinfodbdata/bioconductor-genomeinfodbdata_1.2.7_src_all.tar.gz

stderr: % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 416 100 416 0 0 773 0 --:--:-- --:--:-- --:--:-- 773
md5sum: WARNING: 1 computed checksum did NOT match
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 153 100 153 0 0 129 0 0:00:01 0:00:01 --:--:-- 129
md5sum: WARNING: 1 computed checksum did NOT match
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 153 100 153 0 0 155 0 --:--:-- --:--:-- --:--:-- 155
md5sum: WARNING: 1 computed checksum did NOT match

return code: 1

kwargs:
{}

Traceback (most recent call last):
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1129, in call
return func(*args, **kwargs)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda_env/cli/main.py", line 80, in do_call
exit_code = getattr(module, func_name)(args, parser)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/notices/core.py", line 72, in wrapper
return_value = func(*args, **kwargs)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda_env/cli/main_create.py", line 156, in execute
result[installer_type] = installer.install(prefix, pkg_specs, args, env)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda_env/installers/conda.py", line 59, in install
unlink_link_transaction.execute()
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/core/link.py", line 284, in execute
self._execute(tuple(concat(interleave(self.prefix_action_groups.values()))))
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/core/link.py", line 747, in _execute
raise CondaMultiError(tuple(concatv(
conda.CondaMultiErrorclass: LinkError
message:
post-link script failed for package bioconda::bioconductor-genomeinfodbdata-1.2.7-r41hdfd78af_0
location of failed script: /mnt/HD1/miniconda3/envs/earlGrey/bin/.bioconductor-genomeinfodbdata-post-link.sh
==> script messages <==

==> script output <==
stdout: /mnt/HD1/miniconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
/mnt/HD1/miniconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
/mnt/HD1/miniconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
ERROR: post-link.sh was unable to download any of the following URLs with the md5sum 74c82f26111062a9ceb3c5331088cd56:
https://bioconductor.org/packages/3.14/data/annotation/src/contrib/GenomeInfoDbData_1.2.7.tar.gz
https://bioarchive.galaxyproject.org/GenomeInfoDbData_1.2.7.tar.gz
https://depot.galaxyproject.org/software/bioconductor-genomeinfodbdata/bioconductor-genomeinfodbdata_1.2.7_src_all.tar.gz

return code: 1

kwargs:
{}

: <exception str() failed>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/mnt/HD1/miniconda3/bin/conda-env", line 9, in
sys.exit(main())
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda_env/cli/main.py", line 91, in main
return conda_exception_handler(do_call, args, parser)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1429, in conda_exception_handler
return_value = exception_handler(func, *args, **kwargs)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1132, in call
return self.handle_exception(exc_val, exc_tb)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1161, in handle_exception
return self.handle_application_exception(exc_val, exc_tb)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1175, in handle_application_exception
self._print_conda_exception(exc_val, exc_tb)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1179, in _print_conda_exception
print_conda_exception(exc_val, exc_tb)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1106, in print_conda_exception
stderrlog.error("\n%r\n", exc_val)
File "/mnt/HD1/miniconda3/lib/python3.9/logging/init.py", line 1475, in error
self._log(ERROR, msg, args, **kwargs)
File "/mnt/HD1/miniconda3/lib/python3.9/logging/init.py", line 1589, in _log
self.handle(record)
File "/mnt/HD1/miniconda3/lib/python3.9/logging/init.py", line 1598, in handle
if (not self.disabled) and self.filter(record):
File "/mnt/HD1/miniconda3/lib/python3.9/logging/init.py", line 806, in filter
result = f.filter(record)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/gateways/logging.py", line 50, in filter
record.msg = record.msg % new_args
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/init.py", line 107, in repr
errs.append(e.repr())
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/init.py", line 64, in repr
return '%s: %s' % (self.class.name, str(self))
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/init.py", line 68, in str
return str(self.message % self._kwargs)
ValueError: unsupported format character 'T' (0x54) at index 1032
Setting path variables in script
Path variables set
Extracting zip archives
Extracted required archives
Remember to activate the earl grey conda environment before running earlGrey
earlGrey is ready to use. To execute from any directory, add earlGrey to path by pasting the code (minus the square brackets) below...
[export PATH=$PATH:$(realpath .)]

My server is based on Ubuntu, and I have never run repeatmasker and other related tools till now. So, I am not sure where is the error and how to resolve it. I will be sincerely grateful to get your help. Also, Kindly let me know if I need to provide any information in order to help you give me your opinion to resolve this issue.

thank you so much in advance,

with best regards
Amit

Error at RepeatMerger stage

Hello, I'm running into what I think is an instalation and/or compatibility issue. The pipeline continues up to the generating the *_mergedRepeats stage but fails to geneate the GFF files with the message

Can't locate CrossmatchSearchEngine.pm in @INC (you may need to install the CrossmatchSearchEngine module) (@INC contains: /home/amanda/miniconda3/envs/earlGrey/bin/../ /home/amanda/miniconda3/envs/earlGrey/bin /home/amanda/miniconda3/envs/earlGrey/lib/site_perl/5.26.2/x86_64-linux-thread-multi /home/amanda/miniconda3/envs/earlGrey/lib/site_perl/5.26.2 /home/amanda/miniconda3/envs/earlGrey/lib/5.26.2/x86_64-linux-thread-multi /home/amanda/miniconda3/envs/earlGrey/lib/5.26.2 .) at /home/amanda/miniconda3/envs/earlGrey/bin/rmOutToGFF3.pl line 76.

Latter on the pipeline continues regardles up to the "Resolving Overlaping Repeats" where it also reports that:

Error in library(GenomicRanges) : 
  there is no package called ‘GenomicRanges’
Execution halted

I'm ataching the last 300 lines of the log file
EarlGrey.log

I'm not sure how to solve it. I'm using RepeatMasker v4.1.2-p1 and RepeatModeler v2.0.3, both installed with conda as part of the environment earlGrey. And were installed before running the configure file.
Thanks

LTRStruct parameter

Hello,

Thanks for your work on this tool. We were finally able to get it up and running after many dependency issues. One question I have as I experiment with earlGrey: RepeatModeler2 has the LTRStruct parameter, which we typically enabled on every run. It doesn't seem to be possible to enable this parameter within the EarlGrey commands.

Am I correct in assuming that this is disabled for a reason? Would there be any benefit to turning it on?

Thanks,
Sam

Installation error (using existing RepeatMasker and RepeatModeler installs)

I have a working conda installation of RepeatModeler and RepeatMasker, so I wanted to install EarlGrey in the same conda environment (called “earlGrey”), following the instructions under If you already have RepeatMasker and RepeatModeler.

I get the following error on the ./configure step:

ERROR conda.core.link:_execute(730): An error occurred while installing package 'bioconda::bioconductor-genomeinfodbdata-1.2.7-r41hdfd78af_0'.
Rolling back transaction: done
class: LinkError
message:
post-link script failed for package bioconda::bioconductor-genomeinfodbdata-1.2.7-r41hdfd78af_0
location of failed script: /home/ldapusers/janneke.aylward/anaconda3/envs/earlGrey/bin/.bioconductor-genomeinfodbdata-post-link.sh
==> script messages <==

==> script output <==
stdout: /home/ldapusers/janneke.aylward/anaconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
/home/ldapusers/janneke.aylward/anaconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
/home/ldapusers/janneke.aylward/anaconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
ERROR: post-link.sh was unable to download any of the following URLs with the md5sum 74c82f26111062a9ceb3c5331088cd56:
https://bioconductor.org/packages/3.14/data/annotation/src/contrib/GenomeInfoDbData_1.2.7.tar.gz
https://bioarchive.galaxyproject.org/GenomeInfoDbData_1.2.7.tar.gz
https://depot.galaxyproject.org/software/bioconductor-genomeinfodbdata/bioconductor-genomeinfodbdata_1.2.7_src_all.tar.gz

stderr: % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 416 100 416 0 0 870 0 --:--:-- --:--:-- --:--:-- 870
md5sum: WARNING: 1 computed checksum did NOT match
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 153 100 153 0 0 135 0 0:00:01 0:00:01 --:--:-- 135
md5sum: WARNING: 1 computed checksum did NOT match
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 153 100 153 0 0 141 0 0:00:01 0:00:01 --:--:-- 142
md5sum: WARNING: 1 computed checksum did NOT match

return code: 1

kwargs:
{}

Traceback (most recent call last):
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/exceptions.py", line 1082, in call
return func(*args, **kwargs)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda_env/cli/main.py", line 80, in do_call
exit_code = getattr(module, func_name)(args, parser)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda_env/cli/main_create.py", line 142, in execute
result[installer_type] = installer.install(prefix, pkg_specs, args, env)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda_env/installers/conda.py", line 59, in install
unlink_link_transaction.execute()
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/core/link.py", line 281, in execute
self._execute(tuple(concat(interleave(itervalues(self.prefix_action_groups)))))
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/core/link.py", line 744, in _execute
raise CondaMultiError(tuple(concatv(
conda.CondaMultiErrorclass: LinkError
message:
post-link script failed for package bioconda::bioconductor-genomeinfodbdata-1.2.7-r41hdfd78af_0
location of failed script: /home/ldapusers/janneke.aylward/anaconda3/envs/earlGrey/bin/.bioconductor-genomeinfodbdata-post-link.sh
==> script messages <==

==> script output <==
stdout: /home/ldapusers/janneke.aylward/anaconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
/home/ldapusers/janneke.aylward/anaconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
/home/ldapusers/janneke.aylward/anaconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
ERROR: post-link.sh was unable to download any of the following URLs with the md5sum 74c82f26111062a9ceb3c5331088cd56:
https://bioconductor.org/packages/3.14/data/annotation/src/contrib/GenomeInfoDbData_1.2.7.tar.gz
https://bioarchive.galaxyproject.org/GenomeInfoDbData_1.2.7.tar.gz
https://depot.galaxyproject.org/software/bioconductor-genomeinfodbdata/bioconductor-genomeinfodbdata_1.2.7_src_all.tar.gz

return code: 1

kwargs:
{}

: <exception str() failed>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/ldapusers/janneke.aylward/anaconda3/bin/conda-env", line 7, in
sys.exit(main())
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda_env/cli/main.py", line 91, in main
return conda_exception_handler(do_call, args, parser)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/exceptions.py", line 1374, in conda_exception_handler
return_value = exception_handler(func, *args, **kwargs)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/exceptions.py", line 1085, in call
return self.handle_exception(exc_val, exc_tb)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/exceptions.py", line 1116, in handle_exception
return self.handle_application_exception(exc_val, exc_tb)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/exceptions.py", line 1132, in handle_application_exception
self._print_conda_exception(exc_val, exc_tb)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/exceptions.py", line 1136, in _print_conda_exception
print_conda_exception(exc_val, exc_tb)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/exceptions.py", line 1059, in print_conda_exception
stderrlog.error("\n%r\n", exc_val)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/logging/init.py", line 1463, in error
self._log(ERROR, msg, args, **kwargs)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/logging/init.py", line 1577, in _log
self.handle(record)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/logging/init.py", line 1586, in handle
if (not self.disabled) and self.filter(record):
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/logging/init.py", line 807, in filter
result = f.filter(record)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/gateways/logging.py", line 61, in filter
record.msg = record.msg % new_args
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/init.py", line 132, in repr
errs.append(e.repr())
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/init.py", line 71, in repr
return '%s: %s' % (self.class.name, text_type(self))
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/init.py", line 90, in str
return text_type(self.message % self._kwargs)
ValueError: unsupported format character 'T' (0x54) at index 1120
Setting path variables in script
Path variables set
Extracting zip archives
Extracted required archives
Remember to activate the earl grey conda environment before running earlGrey
earlGrey is ready to use. To execute from any directory, add earlGrey to path by pasting the code (minus the square brackets) below...
[export PATH=$PATH:$(realpath .)]

python error

Hi, I am trying to run EarlGrey but the software dies after the RepeatClassifier step.
I get the following error in the "Straining TEs and Refining de novo Consensus Sequences" step:

ImportError: cannot import name '_aligners' from partially initialized module 'Bio.Align' (most likely due to a circular import) (/services/tools/earlgrey/20221213/lib/python3.6/site-packages/Bio/Align/__init__.py)

Any idea what might be going on?

Does overlapping TE removal leads to ignore possible nested TEs?

Dear TobyBaril,

According to the step 14. Following defragmentation, Earl Grey removes overlapping TE annotations using a custom R
script employing GenomicRanges (Lawrence et al., 2013), which ignores strand information and retains
the longest TE of overlapping pairs.

I wonder if this leads to a underrepresentation of nested TEs?

Thanks!

TEstrainer error

Hi, thanks for providing this helpful pipeline! I installed earlgrey3.2 with conda. My run died at the TEstrainer step with the following error message:

<<< Straining TEs and Refining de novo Consensus Sequences >>>

Splitting run 1
Initial trf check for 1
Initial blast and preparation for MSA 1
Primary alignment run 1
Trimming run 1
Finished extension
cat: 'TS_mAcoRus-families.fa_2547/run_*/complete_mAcoRus-families.fa': No such file or directory
Splitting for simple/satellite packages
Running TRF
Running SA-SSR

Finding SSRs:
Running mreps
Trimming and sorting based on mreps, TRF, SA-SSR
Error in dplyr::mutate():
ℹ In argument: period = as.double(width(ssr)).
Caused by error:
! unable to find an inherited method for function ‘width’ for signature ‘"logical"’
Backtrace:
▆

├─... %>% filter(count > 2)
├─dplyr::filter(., count > 2)
├─dplyr::arrange(., seqnames)
├─dplyr::inner_join(., in_seq_tbl, by = "seqnames")
├─dplyr::mutate(., draft_seqnames = sub("#.*", "", seqnames))
├─dplyr::mutate(...)
├─dplyr::mutate(., ssr = ifelse(is.na(ssr), "NA", ssr), period = as.double(width(ssr)))
├─dplyr:::mutate.data.frame(., ssr = ifelse(is.na(ssr), "NA", ssr), period = as.double(width(ssr)))
│ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
│ ├─base::withCallingHandlers(...)
│ └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
│ └─mask$eval_all_mutate(quo)
│ └─dplyr (local) eval()
├─BiocGenerics::width(ssr)
│ └─methods (local) <fn>(<list>, <stndrdGn>, <env>)
│ └─base::stop(...)
└─base::.handleSimpleError(...)
└─dplyr (local) h(simpleError(msg, call))

└─rlang::abort(message, class = error_class, parent = parent, call = error_call)

Execution halted
Removing temporary files
Reclassifying repeats
cp: cannot stat 'TS_mAcoRus-families.fa_2547/trf/mAcoRus-families.fa.nonsatellite': No such file or directory
No database indicated or it is an empty file.

/home/huayun/mdwilson/huayun/miniconda3/envs/earlgrey/bin/RepeatClassifier - 2.0.5
NAME
RepeatClassifier - Classify Repeat Models

SYNOPSIS
RepeatClassifier [-options] -consensi
[-stockholm ]

DESCRIPTION
The options are:

-h(elp)
    Detailed help

CONFIGURATION OVERRIDES
-ninja_dir
The path to the installation of the Ninja phylogenetic analysis
package.

-ucsctools_dir <string>
    The path to the installation directory of the UCSC TwoBit Tools
    (twoBitToFa, faToTwoBit, twoBitInfo etc).

-repeatmasker_dir <string>
    The path to the installation of RepeatMasker (RM 4.1.4 or higher)

-trf_dir <string>
    The full path to TRF program. TRF must be named "trf". ( 4.0.9 or
    higher )

-cdhit_dir <string>
    The path to the installation of the CD-Hit sequence clustering
    package.

-rmblast_dir <string>
    The path to the installation of the RMBLAST (2.13.0 or higher)

-recon_dir <string>
    The path to the installation of the RECON de-novo repeatfinding
    program.

-genometools_dir <string>
    The path to the installation of the GenomeTools package.

-ltr_retriever_dir <string>
    The path to the installation of the LTR_Retriever (v2.9.0 and
    higher) structural LTR analysis package.

-rscout_dir <string>
    The path to the installation of the RepeatScout ( 1.0.6 or higher )
    de-novo repeatfinding program.

-mafft_dir <string>
    The path to the installation of the MAFFT multiple alignment
    program.

trimAl might not be sufficient to clean MSA

Hi Tobias,

I find you used trimAl to clean multiple sequence alignments. Based on your code, trimAl will remove columns in the MSA that have more than 60% gaps and columns that have a lower than 60% consensus.

In my opinion, that won't be sufficient to clean MSA and find the right boundaries for TEs.

subprocess.run(SOFTWARE + 'trimal/source/trimal -in {} -gt 0.6 -cons 60 -fasta -out {}'.format('muscle/' + ALIGNED, 'muscle/' + FILEPREFIX + '_trimal.fa'), shell=True)

Yours sincerely
Jiangzhao

Documentation update

Heya,

Great tool, just updated to the newer version and have some suggestions for the docs. I can't used docker on our system so had to install without sudo. Changes I'd suggest:

recommend the famdb.py fix for RepeatMasker. Currently the docs suggest installing the latest version of dfam, but this will lead to that weird error w/ 3.7.
cd-hit can be installed via conda, no need for sudo.
seems trf module needs to be renamed to "trf" when configuring RepeatModeler, otherwise the config can't find it.
The conda env is quite complex and takes a while to resolve, not sure if it can be pruned slightly and cd-hit included

Aware it's probably quite busy with the latest updates, I'd be happy to PR some changes if that's easier. Still the most painless TE annotator I've encountered to date :)

Cheers

Permission denied

Hi,

Thanks for developing EarlGrey, it is very well explained and I hope I can complete my analyses with it! Here is my problem:

I installed EarlGrey following the installation guide, step by step, RepeatMasker, RepeatModeller2, etc. The pipeline runs smoothly until the "Straining TEs and Refining de novo Consensus Sequences" step. At this point, the programme builds a new DB and seems to add all the contigs all right, but then it runs into "Permission denied" and crashes.

Changing folder and subfolder permissions doesn't seem to change anything. If this problem has to do with TEstrainer, I could not find any help elsewhere. Any help would be much appreciated!

Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1640 sequences in 17.0914 seconds.
Splitting run 1
Initial trf check for 1
0% 0:2556=0s rnd-1_family-78.fasta
/bin/bash: TS-families.fa_8111/run_1/raw/rnd-1_family-66.fasta: Permission denied
0% 1:2555=0s rnd-1_family-21.fasta
/bin/bash: TS-families.fa_8111/run_1/raw/rnd-1_family-84.fasta: Permission denied
0% 2:2554=0s rnd-1_family-91.fasta

[...]

99% 2555:0=0s rnd-5_family-2924.fasta
 /bin/bash: TS_ipomoeaTriloba-families.fa_7746/run_1/raw/rnd-5_family-2924.fasta: Permission denied
100% 2556:0=0s rnd-5_family-2924.fasta

Initial blast and preparation for MSA 1
0% 0:2556=0s rnd-1_family-78.fasta

TEstrainer

I am trying to follow the protocol for TE identification from Baril & Hayward 2022 for TE identification. I am using a custom library so I want to use the BEE script (TEstrainer) from earl grey (after I have completed the repeat modeler and repeat masker). But there are differences between TEstrainer for earlgrey and TEstrainer. Would it be ok to use TEstrainer outside earlgrey to complete the BEE step? Thanks

installing earlgrey via conda

I had some trouble getting earlgrey installed via conda. Here is what ended working for me. I had to create an env as below and then install earlgrey. Running the conda install command as per your github instructions didnt work for me.

conda create -n earlgrey_env -c conda-forge python=3.8.15 mamba=1.2
conda activate earlgrey_env
mamba install -c bioconda earlgrey

I tested it with the NC_045808_EarlWorkshop.fasta from gitpod and it worked.

an error occurred, "Refining genome not found"

Hi,Professor,thank you for developing this pileline for annotation of TEs. Recently, I used this pileline to annotate a genome of arthropod. My genome size was 2.7Gb. After it runned more than 27 hours, it stopped runed, the last 50 lines of log file are shown below:

Obtaining element sequences
Number of families returned by RECON: 35670
Processing families with greater than 15 elements
Instance Gathering: 00:00:30 (hh:mm:ss) Elapsed Time
About to run 2145 refinement jobs
Refinement: 07:55:55 (hh:mm:ss) Elapsed Time
Family Refinement: 07:56:30 (hh:mm:ss) Elapsed Time
Round Time: 16:59:13 (hh:mm:ss) Elapsed Time : 2009 families discovered.

RepeatScout/RECON discovery complete: 4091 families found

RepeatClassifier Version 2.0.4

Looking for Simple and Low Complexity sequences..
Looking for similarity to known repeat proteins..
Looking for similarity to known repeat consensi..
Classification Time: 05:53:16 (hh:mm:ss) Elapsed Time

Program Time: 27:15:15 (hh:mm:ss) Elapsed Time
Working directory: /public/home/rp1016swf/02.EarlG/spe_EarlGrey/spe_RepeatModeler/RM_28569.SatApr80935442023
may be deleted unless there were problems with the run.

The results have been saved to:
/public/home/rp1016swf/02.EarlG/spe_EarlGrey/spe_Database/spe-families.fa - Consensus sequences for each family identified.
/public/home/rp1016swf/02.EarlG/spe_EarlGrey/spe_Database/spe-families.stk - Seed alignments for each family identified.
/public/home/rp1016swf/02.EarlG/spe_EarlGrey/spe_Database/spe-rmod.log - Execution log. Useful for reproducing results.

The RepeatModeler stockholm file is formatted so that it can
easily be submitted to the Dfam database. Please consider contributing
curated families to this open database and be a part of this growing
community resource. For more information contact [email protected].

          )  (
         (   ) )
         ) ( (
       _______)_
    .-'---------|
   ( C|/\/\/\/\/|
    '-./\/\/\/\/|
     '_________'
      '-------'
    <<< Straining TEs and Refining de novo Consensus Sequences >>>

Refining genome not found
Usage: [-l Repeat library] [-g Genome ] [-t Threads (default 4) ] [-f Flank (default 1000) ] [-r Numver of iterations of BEET to run (deafult 10)] [-d Out directory, if not specified wil be created ] [-h Print this help] [-M Ammount of memory TEstrainer needs to keep free]
cp: cannot stat ‘/public/home/rp1016swf/02.EarlG/spe_EarlGrey/spe_strainer/TS*/spe-families.fa.strained’: No such file or directory
ERROR: TEstrainer failed to produce a strain file, please check the log file for more information

I entered the director"/public/home/rp1016swf/02.EarlG/spe_EarlGrey/spe_strainer", it is an empty directory. However, it runned successfully and produced all result files when i tested it with a small genome file. I don't know what's wrong with its and how should i solve this problem.
Looking forward to your reply.

GFF problems (with Geneious) + feature suggestion

Hi,
I like this tool. I am currently using something a former colleague created a few years ago panTE. It is very exhaustive but suffers from some of the problems mentioned in your preprint, mainly overlapping features. It is also not being developed anymore.

I just ran into a problem with the GFF file produced by EarlGrey, it does not work when imported into Geneious.
First Geneious complains about the 'NA' in colum 8 (phase of feature). This should be either 0, 1 or 2 for CDS features or '.' for everything else.
After fixing that Geneious imports the file fine, but seems to connect all features that have the same (non unique) "ID" tag.

AlKewell_ctg_01	RepeatMasker	Unknown	859	1137	1636	-	.	Tstart=903;Tend=1159;ID=RND-1_FAMILY-96;shortTE=F
AlKewell_ctg_01	RepeatMasker	Unknown	6213	6877	1979	-	.	Tstart=771;Tend=1440;ID=RND-4_FAMILY-443;shortTE=F
AlKewell_ctg_01	RepeatMasker	Unknown	7789	9093	8533	-	.	Tstart=868;Tend=2242;ID=RND-1_FAMILY-32;shortTE=F
AlKewell_ctg_01	RepeatMasker	Unknown	9095	9331	1007	-	.	Tstart=718;Tend=953;ID=RND-1_FAMILY-103;shortTE=F
AlKewell_ctg_01	RepeatMasker	Unknown	9536	10383	1843	-	.	Tstart=91;Tend=729;ID=RND-1_FAMILY-103;shortTE=F
AlKewell_ctg_01	RepeatMasker	LTR/Gypsy	10481	12295	10599	-	.	Tstart=6031;Tend=7846;ID=RND-4_FAMILY-173;shortTE=F
AlKewell_ctg_01	RepeatMasker	LTR/Gypsy	12513	16028	21611	-	.	Tstart=2497;Tend=6035;ID=RND-4_FAMILY-173;shortTE=F
AlKewell_ctg_01	RepeatMasker	DNA/hAT-Ac	16234	16507	1361	-	.	Tstart=6895;Tend=7199;ID=RND-1_FAMILY-1;shortTE=F
AlKewell_ctg_01	RepeatMasker	LTR/Gypsy	17463	19109	10105	-	.	Tstart=860;Tend=2505;ID=RND-4_FAMILY-173;shortTE=F
AlKewell_ctg_01	RepeatMasker	LTR/Gypsy	19937	20871	3210	-	.	Tstart=755;Tend=1649;ID=RND-1_FAMILY-9;shortTE=F
AlKewell_ctg_01	RepeatMasker	LTR/Gypsy	21089	21755	1529	+	.	Tstart=7174;Tend=7847;ID=RND-4_FAMILY-173;shortTE=F

As an imporovement I'd love to see a GFF file that is ready to run through NCBI's table2asn converter to produce annotations that can be submitted.

Thanks for this work!
Cheers,
Johannes

tobybaril / earlgrey Goto Github PK

earlgrey's People

Contributors

Stargazers

Watchers

Forkers

earlgrey's Issues

================================================================ Output

< Resolving Overlapping Repeats >

< Done! >

< Generating Summary Plots >

< Identifying TE Clusters and Member Sequences >

< Tidying Directories and Organising Important Files >

< Done in 01:05:05.00 >

< De Novo Library, Combined Library, Summary Figures, and TE Quantifications in Standard Formats Can Be Found in /home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_summaryFiles/ >

RepeatClassifier Version 2.0.4

Recommend Projects

Recommend Topics

Recommend Org

Jobs

================================================================
Output