tobybaril / earlgrey Goto Github PK
View Code? Open in Web Editor NEWEarl Grey: A fully automated TE curation and annotation pipeline
License: Other
Earl Grey: A fully automated TE curation and annotation pipeline
License: Other
I have been trying to run earlGrey with the clustering option, and CD-HIT is failing and not generating an output file. Below is the error which shows it is unable to open the file (output from what i figured)
Program: CD-HIT, V4.8.1 (+OpenMP), Feb 22 2022, 21:26:56
Command: cd-hit-est -d 0 -aS 0.8 -c 0.8 -G 0 -g 1 -b 500 -r 1
-i
/projects/earlgrey/cgun_EarlGrey/cgun_strainer/TS_cgun-families.fa_5528/cgun-families.fa.strained
-o
/projects/earlgrey/cgun_EarlGrey/cgun_strainer/TS*/cgun-families.fa.strained.clstrd.fa
total seq: 53
longest and shortest : 11046 and 48
Total letters: 98615
Sequences have been sorted
Approximated minimal memory consumption:
Sequence : 0M
Buffer : 1 X 56M = 56M
Table : 1 X 16M = 16M
Miscellaneous : 4M
Total : 77M
Table limit with the given memory limit:
Max number of representatives: 230040
Max number of word counting entries: 90263642
comparing sequences from 0 to 53
53 finished 37 clusters
Approximated maximum memory consumption: 78M
writing new database
Fatal Error:
file opening failed
Program halted !!
If i try the same command with fill path for TS_cgun-families.fa_5528 instead of TS*, it works fine and creates the output file.
cd-hit-est -d 0 -aS 0.8 -c 0.8 -G 0 -g 1 -b 500 -r 1
-i
/projects/earlgrey/cgun_EarlGrey/cgun_strainer/TS_cgun-families.fa_5528/cgun-families.fa.strained
-o
/projects/earlgrey/cgun_EarlGrey/cgun_strainer/TS*/cgun-families.fa.strained.clstrd.fa
Please let me know how it can be fixed so it doesn't fail while running the whole earlGrey pipeline.
Thanks
Bushra
I've been using EarlGrey on several plant genomes and noticed an abundance of RC / Helitron annotations around a gene class of interest. However, a closer look at these annotations and I'm not seeing the expected 5' TC or 3' CTRR motifs associated with this class. I looked at other RC / Helitron annotations in the genomes and am seeing the same effect.
Could this be due to how TE borders are defined in EarlGrey? Alternatively, is this due to how TEs are classified in general? I'd imagine if they are classified solely by similarity to curated sequence in e.g., Dfam, then they'll be frequently misannotated. I'm not sure how this would compare the methods such as HelitronScanner which use 5' 3' motif sequence for annotation.
Happy to share data if needed!
Hello, I am having a problem with testing the tool installation with the NC_045808_EarlWorkshop.fasta test file.
I am attaching the log file. Could you please help me solve it?
Thank you in advance
Dear @TobyBaril
I'm trying to install EarGrey, which will be super useful for my thesis, but there is an error with conda during ./configure
step.
Checking RepeatMasker and RepeatModeler configuration
Success! RepeatMasker is installed and in PATH
Success! RepeatModeler is installed and in PATH
Collecting package metadata (repodata.json): done
Solving environment: \
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
I've installed RepeatMasker and RepeatModeler following "with sudo" instructions. My OS is Ubuntu 18.08, with miniconda3 v23.3.1.
I tried a few times and it's stucked at this step.
My condarc:
auto_activate_base: false
report_errors: false
channels:
- defaults
channel_priority: strict
I have a reasonable internet connection, a lot of free space, and my conda is working properly for other environments. Do you have any idea how to solve this problem?
Thank you in advance!
I have been trying to run EarlGrey on my plant genome. Its a chromosome level assembly of 332Mb. The fasta file consists of 42 sequences with largest chromosome being 65Mb. The smallest sequences in the file is 49280bp. All files have simple headings, for example: >DRSE_pseudomolecule_1. When I run this I get an error. Full log below. It seems to be an issue with the faswap.py script or with makeblastdb. As soon as it gets to the second sequence in the fasta file it gives KeyError. The makeblastdb error might be an issue with RepeatModeler (see discussion here). As I said in issue #62, it works fine with the demo file from gitpod. So I dont know if the issue is in my fasta file, the size of the genome or something else. Hope you can help solve this? The .prep and .prep.bak files are empty so I think its an issue with creating these inputs.
) (
( ) )
) ( (
_______)_
.-'---------|
( C|/\/\/\/\/|
'-./\/\/\/\/|
'_________'
'-------'
<<< Cleaning Genome >>>
grep: /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_9CHR_PGA_assembly_renamed_49k.fasta: binary file matches
Traceback (most recent call last):
File "/shared/biology/bioldata1/bl-bl1067/Cobus/python/envs/earlgrey_env/share/earlgrey-3.2-0/scripts//faswap.py", line 13, in <module>
lines=map(lambda x: ">"+a[x[1:]] if x and x[0]==">" else x,lines);print("\n".join(lines))
File "/shared/biology/bioldata1/bl-bl1067/Cobus/python/envs/earlgrey_env/share/earlgrey-3.2-0/scripts//faswap.py", line 13, in <lambda>
lines=map(lambda x: ">"+a[x[1:]] if x and x[0]==">" else x,lines);print("\n".join(lines))
KeyError: 'DRSE_pseudomolecule_2'
) (
( ) )
) ( (
_______)_
.-'---------|
( C|/\/\/\/\/|
'-./\/\/\/\/|
'_________'
'-------'
<<< Detecting Novel Repeats >>>
Building database DRSE_TEanno:
Reading /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_9CHR_PGA_assembly_renamed_49k.fasta.prep...
Died at /shared/biology/bioldata1/bl-bl1067/Cobus/python/envs/earlgrey_env/bin/BuildDatabase line 331.
The makeblastdb program exited with code 1. Please check your input file(s) for potential formating errors.
/shared/biology/bioldata1/bl-bl1067/Cobus/python/envs/earlgrey_env/bin/makeblastdb returned:
Building a new DB, current time: 11/07/2023 10:00:28
New DB name: /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno
New DB title: ./ef6ns6B8Mh
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
BLAST options error: File ./ef6ns6B8Mh is empty
The command used was: /shared/biology/bioldata1/bl-bl1067/Cobus/python/envs/earlgrey_env/bin/makeblastdb -blastdb_version 4 -out DRSE_TEanno -parse_seqids -dbtype nucl -in ./ef6ns6B8Mh 2>&1
BLAST Database error: No alias or index file found for nucleotide database [/shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno] in search path [/shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_RepeatModeler::]
RepeatModeler Version 2.0.4
===========================
Using output directory = /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_RepeatModeler/RM_185498.TueNov71000292023
Search Engine = rmblast 2.14.1+
Threads = 20
Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.5
LTR Structural Analysis: Disabled [use -LTRStruct to enable]
Random Number Seed: 1699351229
Database = /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno
Database /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno does not contain any sequences!
ERROR: RepeatModeler Failed, Retrying with limit set as Round 5
BLAST Database error: No alias or index file found for nucleotide database [/shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno] in search path [/shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_RepeatModeler::]
RepeatModeler Version 2.0.4
===========================
Using output directory = /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_RepeatModeler/RM_185506.TueNov71000292023
Search Engine = rmblast 2.14.1+
Threads = 20
Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.5
LTR Structural Analysis: Disabled [use -LTRStruct to enable]
Random Number Seed: 1699351229
Database = /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno
Database /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno does not contain any sequences!
ERROR: RepeatModeler Failed, Retrying with limit set as Round 4
BLAST Database error: No alias or index file found for nucleotide database [/shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno] in search path [/shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_RepeatModeler::]
RepeatModeler Version 2.0.4
===========================
Using output directory = /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_RepeatModeler/RM_185514.TueNov71000302023
Search Engine = rmblast 2.14.1+
Threads = 20
Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.5
LTR Structural Analysis: Disabled [use -LTRStruct to enable]
Random Number Seed: 1699351230
Database = /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno
Database /shared/biology/bioldata1/bl-bl1067/Cobus/genome_annotations/DRSE_EarlGrey/DRSE_EarlGrey_49k/DRSE_TEanno_EarlGrey/DRSE_TEanno_Database/DRSE_TEanno does not contain any sequences!
ERROR: RepeatModeler Failed
Hi Toby,
Is it possible to get also a soft masked version of the masked file after running the final RepeatMasker step?
Without needing to run it again? Or loosing the N masked version?
Hello @TobyBaril,
I (excitedly!) tried running the earlGrey version from bioconda
(Specifically, I'm using charliecloud to run the Docker biocontainer hosted here, built automatically from bioconda; i assume this is identical to running inside a conda env)
I ran EarlGrey as earlGrey -s otipulae -o . -g genome.fasta -t 24
, the genome.fasta
is available here
And I got the error log below, upon several attempts:
(Mem limit: 16GB, on slurm cluster, submitted from a nextflow workflow)
Error in fread(file) :
File '/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.rmerge.gff.filtered' does not exist or is non-readable. getwd()=='/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats'
Execution halted
cp: can't stat '/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.filteredRepeats.bed': No such file or directory
Traceback (most recent call last):
File "/usr/local/share/earlgrey-3.1-0/scripts//backSwap.py", line 14, in <module>
table = pd.read_csv(input, names = ['scaf', 'start', 'end', 'repeat', 'score', 'strand'], delim_whitespace = True, header = None)
File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 577, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1407, in __init__
self._engine = self._make_engine(f, self.engine)
File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1661, in _make_engine
self.handles = get_handle(
File "/usr/local/lib/python3.8/site-packages/pandas/io/common.py", line 859, in get_handle
handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.filteredRepeats.bed'
mv: can't rename '/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.filteredRepeats.bed.2': No such file or directory
Traceback (most recent call last):
File "/usr/local/share/earlgrey-3.1-0/scripts//backSwapGFF.py", line 14, in <module>
table = pd.read_csv(input, names = ['scaf', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'], delim_whitespace = True, header = None)
File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 577, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1407, in __init__
self._engine = self._make_engine(f, self.engine)
File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1661, in _make_engine
self.handles = get_handle(
File "/usr/local/lib/python3.8/site-packages/pandas/io/common.py", line 859, in get_handle
handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.rmerge.gff.filtered'
mv: can't rename '/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.rmerge.gff.filtered.2': No such file or directory
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning message:
Failed to locate timezone database
[1] "/usr/local/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/usr/local/share/earlgrey-3.1-0/scripts//makeGff.R"
[5] "--args"
[6] "/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.filteredRepeats.bed"
[7] "/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.rmerge.gff.filtered"
[8] "/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.filteredRepeats.gff"
Error in file(file, "rt") : cannot open the connection
Calls: read.table -> file
In addition: Warning message:
In file(file, "rt") :
cannot open file '/scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_mergedRepeats//otipulae.filteredRepeats.bed': No such file or directory
Execution halted
) (
( ) )
) ( (
_______)_
.-'---------|
( C|/\/\/\/\/|
'-./\/\/\/\/|
'_________'
'-------'
<<< Done! >>>
ERROR: strict merge also failed, check /scratch/Bio/bletcher/nextflow_workdir/a0/b461c1bb1a577af2fa7bae62d8bb1c/otipulae_EarlGrey/otipulae_RepeatMasker_Against_Custom_Library/genome.fasta.prep.out looks as expected
Do you know what this is due to?
Here's a link to the full output log file, if useful: https://drive.google.com/file/d/1drbTh1HU2Nz793zklukcfBVHxrl0abae/view?usp=sharing
Thanks and best
Hello!
I see that in earlGrey.yml
that the version of R used in earlGrey is 4.1.
However, install_r_packages.R
installs version 3.15 of Bioconductor in my setup, which works with version 4.2 of R.
This leads to the last line of install_r_packages.R
failing with the message:
Error: Bioconductor version '3.15' requires R version '4.2'; use `version = '3.14'`
The following errors/warnings also show up down the line when running earlGrey:
Trimming and sorting based on mreps, TRF, SA-SSR
Warning messages:
1: package ‘plyranges’ was built under R version 4.2.1
2: package ‘BiocGenerics’ was built under R version 4.2.1
3: package ‘IRanges’ was built under R version 4.2.1
4: package ‘S4Vectors’ was built under R version 4.2.1
5: package ‘GenomicRanges’ was built under R version 4.2.1
6: package ‘GenomeInfoDb’ was built under R version 4.2.1
Warning messages:
1: package ‘BSgenome’ was built under R version 4.2.1
2: package ‘Biostrings’ was built under R version 4.2.1
3: package ‘XVector’ was built under R version 4.2.1
4: package ‘rtracklayer’ was built under R version 4.2.1
5: no function found corresponding to methods exports from ‘BSgenome’ for: ‘releaseName’
Error in ..Internal(is.unsorted(x, FALSE, FALSE)) :
3 arguments passed to .Internal(is.unsorted) which requires 2
Calls: %>% ... <Anonymous> -> .splitAsList_by_integer_Rle -> <Anonymous>
Execution halted
Removing temporary files
Reclassifying repeats
I could downgrade Bioconductor to 3.14, but R warns me that I would be downgrading 828 packages.
What are the intended versions earlGrey should be working with here? Not sure if I am doing something wrong or if there is some issue with package repositories having been updated.
Hello, everyone,
Thank you for your assistance in resolving my previous issue (#45)!
I would now like to incorporate this figure into both my project process PowerPoint and my upcoming scientific paper. However, I'm uncertain about how to appropriately label it. Could you please suggest a suitable name for the figure?
To be completely honest, I'm not fully comprehending the content of this figure as it falls outside my area of expertise. Could it possibly be a TE distribution? I've come across similar terms like "Transposable element (TE) sequence distribution based on Kimura distance." However, I believe it would be best to seek your guidance and expertise on this matter.
I would greatly appreciate it if you could provide your input as soon as possible. Thank you!
Best regards,
Zoe
Hi,
Thanks again for the interesting workshop for BGA23! As promised, I am looking into writing a bioconda recipe for this: https://github.com/dirkjanvw/bioconda-recipes/blob/add_earlgrey_v3.0/recipes/earlgrey It is not finished for sure, but I am wondering whether you could provide me with a fasta file (on ENA or NCBI) that is reasonably small which I can use for locally testing the bioconda recipe and local installation. That way I can make sure everything with the bioconda recipe will be correct before I submit it in a PR to bioconda-recipes.
Also, should you have any feedback or questions about the recipe, please let me know!
I've managed to run ~20 genomes with earlGrey and everything has been running pretty smoothly (either newly assembled or downloaded from NCBI). So the tool has been awesome.
But I've noticed something strange happening due to a combination repeatmodeler, TE strainer, and potentially darwin Tree of Life (dToL) genomes ...
Its so far happened to 2 genomes (currently running more).
bombus terrestris fasta from the dToL project, downloaded from here, GCA_910591885.2: https://portal.darwintreeoflife.org/data/root/details/Bombus%20terrestris
and
anoplius nigerrimus, downloaded from here:
https://portal.darwintreeoflife.org/data/root/details/Anoplius%20nigerrimus
Basically TE strainer gets stuck blasting a particular sequence. I've re-ran repeatmodeler, re-ran the pipeline using a freshly downloaded fasta. Checked the genome fasta headers to see if they were problematic. They all get stuck and TE strainer goes on infinitely ... (longest TE strainer run so far has been ~14h with 44 threads, before I killed it).
The earlGrey logs so far (on my 3rd attempt), genome fasta, the repeatmodeler sequence that it is getting stuck on are all available here: https://www.dropbox.com/sh/co44hhg0l3mh890/AACEpm6uMkwd0OmLvM7QVfMPa?dl=0
I did a quick scan of the fasta and for bombus the only thing I gleaned is that the TE might be at the end of a contig?
just curious if anything pops out to you as a problem, or if you've run into this issue before ...
cheers.
Hi,
I am running earlGrey on drosophila simulans using the reference genome from NCBI: https://ftp.ncbi.nih.gov/genomes/refseq/invertebrate/Drosophila_simulans/representative/GCF_016746395.2_Prin_Dsim_3.1/GCF_016746395.2_Prin_Dsim_3.1_genomic.fna.gz
But it got an error as following:
.......
Processing RECON family: 723
- Saving 16 elements
- Refining family-723 model...
Family Refinement: 00:22:04 (hh:mm:ss) Elapsed Time
Round Time: 02:52:44 (hh:mm:ss) Elapsed Time
- Increasing sample size to include end piece now = 250032285
RepeatModeler Round # 6
========================
Searching for Repeats
-- Sampling from the database...
- Gathering up to 243000000 bp
FastaDB::compact - Error could not locate file /data/home/xxx/dsim/repeatAnnotation/DsimEarlGrey/Dsim_RepeatModeler/RM_131689.ThuNov110910292021/round-6/sampleDB-6.fa!
at /data/home/xxx/RepeatModeler-2.0.2a/RepeatModeler line 862.
ERROR: RepeatModeler Failed
The error is likely to be caused by small/fragemented reference genome according to Dfam-consortium/RepeatModeler#111. I tried some other reference geneomes and I didn't find such an error.
Could it be fixed in earlGrey or suggest some solutions for the problem?
Thanks
Yiguan
I am trying to run EarlGrey on a plant genome, but I am getting an error:
No repetitive sequences were detected in /scratch/botany/katie/assembled_genomes/assemblies/sandwicensis/working/sandwicensis.fasta.prep
ERROR: RepeatMasker failed, please check logs. This is likely because of an invalid species search term, if issue persists please use NCBI Taxids (E.G Drosophila is replaced with 7125)
I have used the following command within a slurm file which was submitted to the cluster:
species="sandwicensis"
earlGrey -g /scratch/botany/katie/assembled_genomes/assemblies/$species/working/$species.fasta -s $species -o /scratch/botany/katie/maker/earlgrey/$species -r arabidopsis -t 32
I have tried to replace the -r option withj the taxid of arabidopsis but this does not seem to work either.
When I try to run RepeatMasker alone on the .fasta.prep file created by EarlGrey, it seems to run fine, using the following command:
species="sandwicensis"
RepeatMasker -species arabidopsis /scratch/botany/katie/assembled_genomes/assemblies/sandwicensis/working/sandwicensis.fasta.prep
The output looks like this:
____________________
< Checking Parameters >
--------------------
\ ^__^
\ (oo)\_______
(__)\ )\/
||----w |
|| ||
De Novo Sequences Will Be Extended Through 5 Iterations
Clusters Will Be Considered When TEs Are <100bp Apart
Blast, Extract, Extend Process Will Add 1000bp to Each End in Each Iteration
Conda environment is active
____________________
< Making Directories >
--------------------
\ ^__^
\ (oo)\_______
(__)\ )\/
||----w |
|| ||
____________________
< Cleaning Genome >
--------------------
\ ^__^
\ (oo)\_______
(__)\ )\/
||----w |
|| ||
____________________
< Getting RepeatMasker Sequences for arabidopsis and Saving as Fasta >
--------------------
\ ^__^
\ (oo)\_______
(__)\ )\/
||----w |
|| ||
____________________
< Running Initial Mask with RepBase >
--------------------
\ ^__^
\ (oo)\_______
(__)\ )\/
||----w |
|| ||
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LC_CTYPE = "UTF-8",
LANG = "en_US.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").
RepeatMasker version 4.1.2
Search Engine: HMMER [ 3.3 (Nov 2019) ]
Using Master RepeatMasker Database: /apps/repeatmasker/4.1.2/Libraries/RepeatMaskerLib.h5
Title : Dfam
Version : 3.3
Date : 2020-11-09
Families : 6,953
Species/Taxa Search:
Arabidopsis [NCBI Taxonomy ID: 3701]
Lineage: root;cellular organisms;Eukaryota;Viridiplantae;
Streptophyta;Streptophytina;Embryophyta;Tracheophyta;
Euphyllophyta;Spermatophyta;Magnoliopsida;Mesangiospermae;
eudicotyledons;Gunneridae;Pentapetalae;rosids;malvids
9 families in ancestor taxa; 0 lineage-specific families
Building species libraries in: /home/user/emelianova/.RepeatMaskerCache/HMM-Dfam_3.3/arabidopsis
analyzing file /scratch/botany/katie/assembled_genomes/assemblies/sandwicensis/working/sandwicensis.fasta.prep
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LC_CTYPE = "UTF-8",
LANG = "en_US.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").
No repetitive sequences were detected in /scratch/botany/katie/assembled_genomes/assemblies/sandwicensis/working/sandwicensis.fasta.prep
ERROR: RepeatMasker failed, please check logs. This is likely because of an invalid species search term, if issue persists please use NCBI Taxids (E.G Drosophila is replaced with 7125)
Is there anything I am missing or should check?
hey can you help me with that issue?
RepeatMasker works on my computer well.
(earlGrey) marcin@marcin-B550-AORUS-ELITE-AX-V2:~/EarlGrey$ ./earlGrey -g bombyxMori.fasta. -s bombyxMori -o ./repeatAnnotation/ -r arthropoda -t 16
Hi,
I tried to run earlGrey locally by'mamba install earlgrey -c bioconda -c conda-forge` and updated to the 4.0 version on conda.
About softmasking. Softmasking seems not working? first run I tried
nohup /usr/bin/time -v earlGrey -g ne_1019.fa -s v1019 -o ./repeatAnnotation -r hemiptera -t 20 -d yes > 1.log 2>2.err &
but found
/home/s76/mambaforge/envs/earlgrey/bin/earlGrey: illegal option -- d
in 2.err
Softmasked genome will not be generated
in 1.log
About threads. In some other pipelines, like EDTA/LTR_retriever the -t option should be 1/4 of the all free threads to utilize because the default rmblastn already runs 4 threads per process. In earlygrey, it seems working differently. and I want to apply more threads to save time. I killed the process, did not remove any generated files, and reran with -t 80 to see if runtime could be reduced.
nohup /usr/bin/time -v earlGrey -g ne_1019.fa -s v1019 -o ./repeatAnnotation -r hemiptera -t 80 -d yes > 1.log 2>2.err &
About consistency. In the second run with -t 80
, I noticed * families discovered
of RepeatModeler Round #2
in rmod.log
is different from the first run with -t 20
. I killed the process, did not remove any generated files, and rerun with -t 92
to see if results differed. * families discovered
for -t 20, 80, and 92 are 7, 8, and 4, respectively.
3.1 rmod.log of first run with -t 20
RepeatModeler Round # 2
========================
Searching for Repeats
-- Sampling from the database...
- Gathering up to 10000000 bp
- Sequence extraction : 00:00:08 (hh:mm:ss) Elapsed Time
-- Running TRFMask on the sequence...
- TRFMask time 00:00:15 (hh:mm:ss) Elapsed Time
-- Masking repeats from the previous rounds...
15159 repeats masked totaling 2566426 bp(s).
- TE Masking time 00:00:15 (hh:mm:ss) Elapsed Time
-- Sample Stats:
Sample Size 10442024 bp
Num Contigs Represented = 17
Non ambiguous bp:
Initial: 10005258 bp
After Masking: 7313959 bp
Masked: 26.90 %
-- Input Database Coverage: 10442024 bp out of 525502193 bp ( 1.99 % )
Sampling Time: 00:00:38 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
- Total Comparisons = 33930
Comparison Time: 00:03:01 (hh:mm:ss) Elapsed Time, 5226 HSPs Collected
Number of families returned by RECON: 1108
Round Time: 00:03:44 (hh:mm:ss) Elapsed Time : 7 families discovered.
3.2 rmod.log of second run with -t 80
RepeatModeler Round # 2
========================
Searching for Repeats
-- Sampling from the database...
- Gathering up to 10000000 bp
- Sequence extraction : 00:00:08 (hh:mm:ss) Elapsed Time
-- Running TRFMask on the sequence...
- TRFMask time 00:00:15 (hh:mm:ss) Elapsed Time
-- Masking repeats from the previous rounds...
15753 repeats masked totaling 2635065 bp(s).
- TE Masking time 00:00:07 (hh:mm:ss) Elapsed Time
-- Sample Stats:
Sample Size 10480609 bp
Num Contigs Represented = 15
Non ambiguous bp:
Initial: 10014725 bp
After Masking: 7235025 bp
Masked: 27.76 %
-- Input Database Coverage: 10480609 bp out of 525502193 bp ( 1.99 % )
Sampling Time: 00:00:31 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
- Total Comparisons = 34191
Comparison Time: 00:01:55 (hh:mm:ss) Elapsed Time, 5866 HSPs Collected
Number of families returned by RECON: 1187
Round Time: 00:02:46 (hh:mm:ss) Elapsed Time : 8 families discovered.
3.3 rmod.log of second run with -t 92
RepeatModeler Round # 2
========================
Searching for Repeats
-- Sampling from the database...
- Gathering up to 10000000 bp
- Sequence extraction : 00:00:09 (hh:mm:ss) Elapsed Time
-- Running TRFMask on the sequence...
- TRFMask time 00:00:16 (hh:mm:ss) Elapsed Time
-- Masking repeats from the previous rounds...
15858 repeats masked totaling 2733407 bp(s).
- TE Masking time 00:00:08 (hh:mm:ss) Elapsed Time
-- Sample Stats:
Sample Size 10480663 bp
Num Contigs Represented = 17
Non ambiguous bp:
Initial: 10007133 bp
After Masking: 7117176 bp
Masked: 28.88 %
-- Input Database Coverage: 10480663 bp out of 525502193 bp ( 1.99 % )
Sampling Time: 00:00:34 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
- Total Comparisons = 34191
Comparison Time: 00:01:54 (hh:mm:ss) Elapsed Time, 5194 HSPs Collected
Number of families returned by RECON: 1099
Round Time: 00:02:33 (hh:mm:ss) Elapsed Time : 4 families discovered.
Hi,
I installed the docker version as described in the docker section of the Readme, but it keeps crashing with ERROR: RepeatModeler Failed
I have successfully run the same genome after installing EarlGrey as described in the with sudo section.
Attached is the log file from the failing run.
AlentisEarlGrey.log
As well as the successful run without docker:
AlentisEarlGrey_good.log
Cheers.
I had a quick look in the files and line 2474 of the failed run says Missing /opt/RepeatMasker/Libraries/RepeatMasker.lib.nsq!
That might point to a problem during the build process of the container.
I'll try to build it again from scratch.
Hi there. While using the latest docker container (pulled on 2023-06-30) and running a test with this command:
earlGrey -g sequence.fasta -s speciesName -o outputDirectory -t 16
this error came up after RepeatModeler:
Trimming and sorting based on mreps, TRF, SA-SSR
Removing temporary files
Reclassifying repeats
Unknown option: pa
/opt/RepeatModeler/RepeatClassifier - 2.0.4
followed by the help screen text of RepeatClassifier.
I checked this against a colleague's older run of EarlGrey and they had
/opt/RepeatModeler/RepeatClassifier - 2.0.2
So my best guess is that this latest version of RepeatClassifier does not have the the -pa
option for specifying how many threads to run in parallel. (see https://github.com/Dfam-consortium/RepeatModeler/blob/master/RepeatClassifier and https://github.com/Dfam-consortium/RepeatModeler/blob/master/RepModelConfig.pm as well)
Hello Toby,
Thank you for generating the tool and for the detailed and useful paper besides it! I work on a HPC and we installed earlgrey with a singularity container using the docker file as a template. I have run EarlGrey twice, on Chromosome 1 of two species using: earlGrey -g species_chr1.fas -s species_chr1 -o earlGreyOutputs -t 34
Firstly, unfortunately, both times my species.chr1_Curated_Library
directory is empty. Do you have any idea why? Is it possibly because one chromosome does not have enough data for this? All other directories have content. Also my summaryFiles
directory is missing the de novo
andcombined repeat
fasta library.
Secondly, the output .gff file in summaryFiles is: .filteredRepeats.gff
but this is very limited and shows hardly any TEs the same as previous runs with EDTA and RepeatMasker. The .gff I found most useful was the: .out.gff3
in the mergedRepeats
directory, however the names of the TEs are not contained in the .gff, only "dispersed_repeat" in column 3 rather than a TE classification. For my analysis I need a .gff file with the coordinates and classification, maybe I have to intersect this with the familyLevelCount.txt
if there is no other .gff that should have been produced?
For housekeeping:
These are the outputs in my SummaryFiles directory:
-families.fa.strained
.filteredRepeats.gff
.filteredRepeats.bed
.repeatLandscape.pdf
.familyLevelCount.txt
.highLevelCount.txt
.summaryPie.pdf
My results in the mergedRepeats directory:
.filteredRepeats.gff
.rmerge.gff.filtered
.filteredRepeats.bed
.filteredRepeats.bed.bak
.filteredRepeats.summary
.mergedRepeats.revisedTable
.mergedRepeats.bed
.rmerge.gff.sorted
.summary.txt
.rmerge.gff
.rclabel.gff
ltrfinder_reformat_label.gff
_chr1RepeatCraft
.out.gff2
.out.gff3
_chr1LtrFinder
N.B all these files in both directories were created at the same time... I don't know if that helps!
Also, if there is some kind of info page on the outputs that I have missed, please direct me there. I hope my queries make sense.
Best wishes,
Isabella
Hello,
Thanks for your work on this package!
I have been having issues trying to use EarlGrey as a non-root user on my university compute cluster. I followed the instructions under Earl Grey Installation and Configuration (If you DO NOT have RepeatMasker and RepeatModeler) - WITHOUT SUDO PRIVILEGES, but I also had to install CDHIT to properly configure RepeatMasker.
The program runs fine until this stage:
<<< Straining TEs and Refining de novo Consensus Sequences >>>
Building a new DB, current time: 08/28/2023 00:43:07
New DB name: /work/users/a/d/adaigle/EarlGrey_experiments/data/ISO1_GCF_000001215.4_Release_6_prepped.fasta.
prep
New DB title: /work/users/a/d/adaigle/EarlGrey_experiments/data/ISO1_GCF_000001215.4_Release_6_prepped.fasta.
prep
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1870 sequences in 2.40207 seconds.
Splitting run 1
Initial trf check for 1
sh: /dev/tty: No such device or address
From my brief googling, I suspect this tty error might have to do with me not having administrative privileges on my cluster, but I'm not sure. It could also be that I am launching this as a SLURM job.
Again the program seems to have run without error (and produced outputs) until this stage. The strainer directory still has outputs, but my Curated Library folder is empty. Let me know if you need any more information!
Hi Toby,
I've had trouble installing via the docker, with the following output at the configure stage:
./configure: 5: /anaconda3/envs/earlGrey/etc/conda/activate.d/activate-binutils_linux-64.sh: Syntax error: "(" unexpected The command '/bin/sh -c cd /opt/ && git clone https://github.com/TobyBaril/EarlGrey && cd EarlGrey && chmod +x ./configure && eval "$(/anaconda3/bin/conda shell.bash hook)" && ./configure' returned a non-zero code: 2
Potentially this is because the YML file that creates the Anaconda environment inside the container has pinned versions of packages that have a bug?
conda/conda#9959
Let me know this is fixable
Thank you,
Luke
Hello,
I was trying to use Earl Gray for my genome but at the end the end the execution was halted due to error below:
FileNotFoundError: [Errno 2] No such file or directory: '/home/edelab/pacbio/genome_analysis/earlgray/pb_output/plasmodiophoraBrassicae_EarlGrey/plasmodiophoraBrassicae_mergedRepeats//plasmodiophoraBrassicae.filteredRepeats.bed'
mv: cannot stat '/home/edelab/pacbio/genome_analysis/earlgray/pb_output/plasmodiophoraBrassicae_EarlGrey/plasmodiophoraBrassicae_mergedRepeats//plasmodiophoraBrassicae.filteredRepeats.bed.2': No such file or directory
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.4 ✔ purrr 1.0.2
✔ tibble 3.2.1 ✔ dplyr 1.1.4
✔ tidyr 1.3.0 ✔ stringr 1.5.1
✔ readr 2.1.4 ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Warning messages:
1: package ‘ggplot2’ was built under R version 4.2.3
2: package ‘tibble’ was built under R version 4.2.3
3: package ‘tidyr’ was built under R version 4.2.3
4: package ‘readr’ was built under R version 4.2.3
5: package ‘purrr’ was built under R version 4.2.3
6: package ‘dplyr’ was built under R version 4.2.3
7: package ‘stringr’ was built under R version 4.2.3
8: package ‘forcats’ was built under R version 4.2.3
[1] "/home/edelab/miniconda3/envs/earlgrey/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/home/edelab/miniconda3/envs/earlgrey/share/earlgrey-4.0-0/scripts//makeGff.R"
[5] "--args"
[6] "/home/edelab/pacbio/genome_analysis/earlgray/pb_output/plasmodiophoraBrassicae_EarlGrey/plasmodiophoraBrassicae_mergedRepeats//plasmodiophoraBrassicae.filteredRepeats.bed"
[7] "/home/edelab/pacbio/genome_analysis/earlgray/pb_output/plasmodiophoraBrassicae_EarlGrey/plasmodiophoraBrassicae_mergedRepeats//plasmodiophoraBrassicae.rmerge.gff.filtered"
[8] "/home/edelab/pacbio/genome_analysis/earlgray/pb_output/plasmodiophoraBrassicae_EarlGrey/plasmodiophoraBrassicae_mergedRepeats//plasmodiophoraBrassicae.filteredRepeats.gff"
Error in file(file, "rt") : cannot open the connection
Calls: read.table -> file
In addition: Warning message:
In file(file, "rt") :
cannot open file '/home/edelab/pacbio/genome_analysis/earlgray/pb_output/plasmodiophoraBrassicae_EarlGrey/plasmodiophoraBrassicae_mergedRepeats//plasmodiophoraBrassicae.filteredRepeats.bed': No such file or directory
Execution halted
) (
( ) )
) ( (
_______)_
.-'---------|
( C|/\/\/\/\/|
'-./\/\/\/\/|
'_________'
'-------'
<<< Done! >>>
ERROR: strict merge also failed, check /home/edelab/pacbio/genome_analysis/earlgray/pb_output/plasmodiophoraBrassicae_EarlGrey/plasmodiophoraBrassicae_RepeatMasker_Against_Custom_Library/pb3A_assembly.fasta.prep.out looks as expected
Can you help me in this regard!
Thanks
As i see there are _.filteredRepeats.bed _.highLevelCount.txt _.summaryPie.pdf
_de_novo_repeat_library_iter4.fasta.clustered.fa _.filteredRepeats.gff _.repeatLandscape.pdf in the __summaryFiles/
and
_.clusTErs.bed _.detailedClusTErs.bed in the _clusTErs/
could you please tell me how to get the genome that has been masked repeating sequences to do the next annotation step,eg:Braker2,thank you!!
Hello! One of my earlGrey runs died several days into its runtime, and I was hoping to avoid having to wait for all that work to complete again. Is there a way to re-enter the pipeline at a particular point?
It looks like my run failed at Trimming and sorting based on mreps, TRF, SA-SSR
, which makes sense given my R issues in #57.
The log following this section reads:
Trimming and sorting based on mreps, TRF, SA-SSR
Warning messages:
1: package ‘plyranges’ was built under R version 4.2.1
2: package ‘BiocGenerics’ was built under R version 4.2.1
3: package ‘IRanges’ was built under R version 4.2.1
4: package ‘S4Vectors’ was built under R version 4.2.1
5: package ‘GenomicRanges’ was built under R version 4.2.1
6: package ‘GenomeInfoDb’ was built under R version 4.2.1
Warning messages:
1: package ‘BSgenome’ was built under R version 4.2.1
2: package ‘Biostrings’ was built under R version 4.2.1
3: package ‘XVector’ was built under R version 4.2.1
4: package ‘rtracklayer’ was built under R version 4.2.1
5: no function found corresponding to methods exports from ‘BSgenome’ for: ‘releaseName’
Error in ..Internal(is.unsorted(x, FALSE, FALSE)) :
3 arguments passed to .Internal(is.unsorted) which requires 2
Calls: %>% ... <Anonymous> -> .splitAsList_by_integer_Rle -> <Anonymous>
Execution halted
Removing temporary files
Reclassifying repeats
cp: cannot stat 'TS_Tagetes_erecta-families.fa_8767/trf/Tagetes_erecta-families.fa.nonsatellite': No such file or directory
No database indicated or it is an empty file.
/lab/solexa_weng/testtube/matthew/earlgrey/RepeatModeler-2.0.4/RepeatClassifier - 2.0.4
... [the help info for repeatclassifier is printed here]
/lab/solexa_weng/Documents/matthew/scripts/test/earlGrey/Tagetes_erecta_EarlGrey/Tagetes_erecta_strainer
Compiling library
cat: TS_Tagetes_erecta-families.fa_8767/trf/Tagetes_erecta-families.fa.satellites: No such file or directory
ERROR: TEstrainer failed to produce a strain file, please check the log file for more information
slurmstepd: error: Detected 8 oom-kill event(s) in StepId=645700.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
Hi Toby
I'm trying to use earlGrey to generate a set of libraries for further curation. However, I'm facing an inconvenience.
After the de novo prediction and classification, when earlGrey use TEstrainer_for_earlGrey.sh, I obtain a message like this:
Splitting run 1
Initial trf check for 1
sh: 1: cannot open /dev/tty: No such device or address
5% 157:2521=0s rnd-1_family-78.fasta sh: 1: cannot open /dev/tty: No such device or address
13% 360:2318=6s rnd-1_family-358.fasta sh: 1: cannot open /dev/tty: No such device or address
20% 560:2118=7s rnd-1_family-561.fasta sh: 1: cannot open /dev/tty: No such device or address
28% 762:1916=7s rnd-1_family-744.fasta sh: 1: cannot open /dev/tty: No such device or address
36% 966:1712=7s rnd-1_family-967.fasta sh: 1: cannot open /dev/tty: No such device or address
43% 1170:1508=6s rnd-4_family-1559.fasta sh: 1: cannot open /dev/tty: No such device or address
51% 1368:1310=5s rnd-4_family-1301.fasta sh: 1: cannot open /dev/tty: No such device or address
58% 1570:1108=4s rnd-4_family-2876.fasta sh: 1: cannot open /dev/tty: No such device or address
66% 1773:905=4s rnd-5_family-4154.fasta sh: 1: cannot open /dev/tty: No such device or address
73% 1975:703=3s rnd-5_family-699.fasta sh: 1: cannot open /dev/tty: No such device or address
81% 2178:500=2s rnd-5_family-5375.fasta sh: 1: cannot open /dev/tty: No such device or address
88% 2379:299=1s rnd-5_family-2862.fasta sh: 1: cannot open /dev/tty: No such device or address
96% 2583:95=0s rnd-5_family-6095.fasta sh: 1: cannot open /dev/tty: No such device or address
100% 2678:0=0s rnd-5_family-12418.fasta
Initial blast and preparation for MSA 1
sh: 1: cannot open /dev/tty: No such device or address
0% 0:2678=0s rnd-1_family-21.fasta Whoa boy, rnd-1_family-134#DNA/Zisupton is inside tandem repeat
0% 1:2677=0s rnd-1_family-19.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 1:2677=44m37s rnd-1_family-19.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 1:2677=44m37s rnd-1_family-19.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 1:2677=44m37s rnd-1_family-19.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 1:2677=44m37s rnd-1_family-19.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 1:2677=44m37s rnd-1_family-19.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 1:2677=44m37s rnd-1_family-19.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 5:2673=44m37s rnd-1_family-38.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 11:2667=49m00s rnd-1_family-31.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 13:2665=45m49s rnd-1_family-55.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 13:2665=43m02s rnd-1_family-55.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 14:2664=42m14s rnd-1_family-68.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 14:2664=42m01s rnd-1_family-68.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 14:2664=42m00s rnd-1_family-68.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 16:2662=41m40s rnd-1_family-161.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 18:2660=41m03s rnd-1_family-87.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 22:2656=39m00s rnd-1_family-156.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 23:2655=37m20s rnd-1_family-8.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 24:2654=36m50s rnd-1_family-81.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 25:2653=36m20s rnd-1_family-63.fasta sh: 1: cannot open /dev/tty: No such device or address
0% 26:2652=35m57s rnd-1_family-40.fasta sh: 1: cannot open /dev/tty: No such device or address
1% 27:2651=35m45s rnd-1_family-32.fasta sh: 1: cannot open /dev/tty: No such device or address
1% 30:2648=35m18s rnd-1_family-197.fasta sh: 1: cannot open /dev/tty: No such device or address
1% 33:2645=34m03s rnd-1_family-45.fasta sh: 1: cannot open /dev/tty: No such device or address
1% 34:2644=33m12s rnd-1_family-130.fasta sh: 1: cannot open /dev/tty: No such device or address
1% 35:2643=32m50s rnd-1_family-1.fasta sh: 1: cannot open /dev/tty: No such device or address
1% 36:2642=32m36s rnd-1_family-128.fasta sh: 1: cannot open /dev/tty: No such device or address
1% 36:2642=32m36s rnd-1_family-128.fasta sh: 1: cannot open /dev/tty: No such device or address
1% 37:2641=32m36s rnd-1_family-153.fasta sh: 1: cannot open /dev/tty: No such device or address
1% 37:2641=32m36s rnd-1_family-153.fasta sh: 1: cannot open /dev/tty: No such device or address
1% 39:2639=32m36s rnd-1_family-115.fasta sh: 1: cannot open /dev/tty: No such device or address
1% 39:2639=32m36s rnd-1_family-115.fasta sh: 1: cannot open /dev/tty: No such device or address
1% 41:2637=32m36s rnd-1_family-91.fasta sh: 1: cannot open /dev/tty: No such device or address
1% 44:2634=32m36s rnd-1_family-121.fasta sh: 1: cannot open /dev/tty: No such device or address
1% 50:2628=31m57s rnd-1_family-182.fasta sh: 1: cannot open /dev/tty: No such device or address
1% 51:2627=31m30s rnd-1_family-58.fasta sh: 1: cannot open /dev/tty: No such device or address
Do you have any clue about what is going on?
Thanks for your support!
Hi all,
This is a naive question, as I'm not personally a user of EarlGrey, but am interpreting output produced by a collaborator.
What is the meaning of the 'score' column of the GFF output? Is it Kimura distance?
I apologise if this information was included in the documentation but I missed it.
Thank you in advance,
Luke
Hi,
Thank you for making earlGrey. I tried to use it through gitpod and wanted to include RepBase.
I uploaded RMRBSeqs.embl
and RMRBSeqs.embl
in /workspace/conda/envs/earlGrey/share/RepeatMasker/Libraries
and run perl ./configure
(earlGrey) gitpod /workspace/conda/envs/earlGrey/share/RepeatMasker $ perl ./configure
Enter Selection: 5
Building FASTA version of RepeatMasker.lib ........................................
Building RMBlast frozen libraries..
The program is installed with a the following repeat libraries:
File: /workspace/conda/envs/earlGrey/share/RepeatMasker/Libraries/RepeatMaskerLib.h5
FamDB Generator: famdb.py v0.4.2
FamDB Format Version: 0.5
FamDB Creation Date: 2023-01-08 10:42:05.645898
Database: Dfam withRBRM
Version: 3.7
Date: 2023-01-11
Dfam - A database of transposable element (TE) sequence alignments and HMMs.
RBRM - RepBase RepeatMasker Edition - version 20181026
Total consensus sequences: 64595
Total HMMs: 19730
Further documentation on the program may be found here:
/workspace/conda/envs/earlGrey/share/RepeatMasker/repeatmasker.help
It changed RMBlast: [ Configured, Default ] and leaving others unconfigured
and I run the command
nohup /usr/bin/time -v earlGrey -g /workspace/ne_1019.fa -s v1019 -o ./repeatAnnotation -r insecta -t 4 > 1.log 2>2.err &
I noticed in logfile it reported
/workspace/EarlGrey/earlGrey: line 83: famdb.py: command not found
Is it a bug or it's ok to ignore? Am I doing the configuration/running right?
Progress:3571426/3571437...
Progress:3571427/3571437...
Progress:3571428/3571437...
Progress:3571429/3571437...
Progress:3571430/3571437...
Progress:3571431/3571437...
Progress:3571432/3571437...
Progress:3571433/3571437...
Progress:3571434/3571437...
Progress:3571435/3571437...
Progress:3571436/3571437...Step 5: Merging GFF records by labels...
Step 6: Writing stat file..Removing tmp files...
Done
Traceback (most recent call last):
File "/home/dell/EarlGrey/scripts/repeatCraft/repeatcraft.py", line 187, in
rcStatm.rcstat(rclabelp=outputnamelabel,rmergep=outputnamemerge,outfile= statfname, ltrgroup = True)
File "/home/dell/EarlGrey/scripts/repeatCraft/helper/rcStatm.py", line 54, in rcstat
if rowRaw.get(col[2]):
IndexError: list index out of range
\ ^__^
\ (oo)\_______
(__)\ )\/
||----w |
|| ||
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, basename, cbind, colnames,
dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: ‘S4Vectors’
The following objects are masked from ‘package:base’:
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Warning messages:
1: package ‘GenomicRanges’ was built under R version 4.1.2
2: package ‘S4Vectors’ was built under R version 4.1.2
3: package ‘IRanges’ was built under R version 4.1.2
Warning message:
package ‘ape’ was built under R version 4.1.2
[1] "/home/dell/miniconda3/envs/earlGrey/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/home/dell/EarlGrey/scripts/filteringOverlappingRepeats.R"
[5] "--args"
[6] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.rmerge.gff.sorted"
[7] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.rmerge.gff.filtered"
Error: package or namespace load failed for ‘tidyverse’:
package ‘rlang’ was installed before R 4.0.0: please re-install it
Execution halted
cp: cannot stat '/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed': No such file or directory
Traceback (most recent call last):
File "/home/dell/EarlGrey/scripts/backSwap.py", line 14, in
table = pd.read_csv(input, names = ['scaf', 'start', 'end', 'repeat', 'score', 'strand'], delim_whitespace = True, header = None)
File "/home/dell/miniconda3/envs/earlGrey/lib/python3.6/site-packages/pandas/io/parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/dell/miniconda3/envs/earlGrey/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/home/dell/miniconda3/envs/earlGrey/lib/python3.6/site-packages/pandas/io/parsers.py", line 948, in init
self._make_engine(self.engine)
File "/home/dell/miniconda3/envs/earlGrey/lib/python3.6/site-packages/pandas/io/parsers.py", line 1180, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/dell/miniconda3/envs/earlGrey/lib/python3.6/site-packages/pandas/io/parsers.py", line 2010, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 382, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] No such file or directory: '/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed'
mv: cannot stat '/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed.2': No such file or directory
Error: package or namespace load failed for ‘tidyverse’:
package ‘rlang’ was installed before R 4.0.0: please re-install it
Execution halted
\ ^__^
\ (oo)\_______
(__)\ )\/
||----w |
|| ||
\ ^__^
\ (oo)\_______
(__)\ )\/
||----w |
|| ||
Error: package or namespace load failed for ‘tidyverse’:
package ‘rlang’ was installed before R 4.0.0: please re-install it
Execution halted
Error: package or namespace load failed for ‘tidyverse’:
package ‘rlang’ was installed before R 4.0.0: please re-install it
Execution halted
\ ^__^
\ (oo)\_______
(__)\ )\/
||----w |
|| ||
Error: Unable to open file /home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed. Exiting.
Error: The requested file (/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed) could not be opened. Error message: (No such file or directory). Exiting!
\ ^__^
\ (oo)\_______
(__)\ )\/
||----w |
|| ||
cp: cannot stat '/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed': No such file or directory
cp: cannot stat '/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.gff': No such file or directory
\ ^__^
\ (oo)\_______
(__)\ )\/
||----w |
|| ||
\ ^__^
\ (oo)\_______
(__)\ )\/
||----w |
|| ||
Could you please tell me how to solve it? thank you!
Hello, I wonder how you expect to be the performance of EarlGrey with low coverage datasets and how it might compare with RepeatExplorer2? Thank you.
Hi,
One of the features I found useful with EDTA was the script for combining consensus repeat libraries across multi-sample datasets. Is there a way to combine Earl Grey results from assemblies of different individuals, or better yet use a Pangenome graph or VCF file as the input?
I produced a fairly simple wrapper script for annotating a VCF with a multi-sample consensus repeat library. Just using RepeatMasker to annotate the inserted sequences in a "pangenome" VCF (i.e. variants called with assembly vs assembly alignments). https://github.com/swomics/VCF_TE_annotate. Maybe this could be tweaked to become a module?
Cheers,
Sam
Hi Toby,
How long does earlGrey usually take to complete? The genome assembly is about 650MB with 17 chrs. The EarlGrey-v3.0 job (with '-t 8') has been running for over 9 days. Is this normal? Thank you.
Hi Toby!
I've been analysing the results from running the latest version of earlGrey and ran into an error when using bedtools coverage with the final gff made by Earl Grey (found in X_summaryFiles).
The error from running bedtools coverage -a species.seqlen.bed -b species.repeats.gff is:
Error: Invalid record in file species.repeats.gff. Record is
LR990257.1 RepeatMasker Unknown 66015 65838 500 - . TSTART=2;TEND=63;ID=RND-1_FAMILY-358;SHORTTE=T
I believe this is because Bedtools expects gffs to be formatted to always have a start coordinate smaller than the end format, which is conventional for gffs. It would be great if this could be fixed so that gffs are consistent.
Best wishes,
Charlotte
disregard, just made it to the end of the installation description :-)
Hi Toby,
I sometimes have issues with the backSwap.py step. No obvious error messages but I'm not getting the file with properly swapped chr names, which causes issues in the subsequent step. I was wondering if you think this awk liner is doing the correct thing swapping the chr names back:
awk 'BEGIN{FS="\t";OFS="\t"}NR==FNR{a[$2]=$1;next}($1 in a){$1=a[$1]}1'
[dict} {species}.rmerge.gff.filtered > {species}.rmerge.gff.filtered.2
If so is it possible to replace backSwap.py with something similar?
Thanks!
Hi,
I am new to bioinformatics, and was struggling to perform repeat prediction and annotation. I came across the bioarchive paper for EarlGrey, and felt that this tool could help me. Therefore, I firstly installed all the needed packages including repeartmasker and others using conda, then I got the repeat libraries, kept it in the /usr/local/RepeatMasker/Libraries/, and followed the installation instructions when sudo privilage is not allowed. While installing, I got following error message after many packages got installed-
ERROR conda.core.link:_execute(733): An error occurred while installing package 'bioconda::bioconductor-genomeinfodbdata-1.2.7-r41hdfd78af_0'.
Rolling back transaction: done
class: LinkError
message:
post-link script failed for package bioconda::bioconductor-genomeinfodbdata-1.2.7-r41hdfd78af_0
location of failed script: /mnt/HD1/miniconda3/envs/earlGrey/bin/.bioconductor-genomeinfodbdata-post-link.sh
==> script messages <==
==> script output <==
stdout: /mnt/HD1/miniconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
/mnt/HD1/miniconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
/mnt/HD1/miniconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
ERROR: post-link.sh was unable to download any of the following URLs with the md5sum 74c82f26111062a9ceb3c5331088cd56:
https://bioconductor.org/packages/3.14/data/annotation/src/contrib/GenomeInfoDbData_1.2.7.tar.gz
https://bioarchive.galaxyproject.org/GenomeInfoDbData_1.2.7.tar.gz
https://depot.galaxyproject.org/software/bioconductor-genomeinfodbdata/bioconductor-genomeinfodbdata_1.2.7_src_all.tar.gz
stderr: % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 416 100 416 0 0 773 0 --:--:-- --:--:-- --:--:-- 773
md5sum: WARNING: 1 computed checksum did NOT match
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 153 100 153 0 0 129 0 0:00:01 0:00:01 --:--:-- 129
md5sum: WARNING: 1 computed checksum did NOT match
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 153 100 153 0 0 155 0 --:--:-- --:--:-- --:--:-- 155
md5sum: WARNING: 1 computed checksum did NOT match
return code: 1
kwargs:
{}
Traceback (most recent call last):
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1129, in call
return func(*args, **kwargs)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda_env/cli/main.py", line 80, in do_call
exit_code = getattr(module, func_name)(args, parser)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/notices/core.py", line 72, in wrapper
return_value = func(*args, **kwargs)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda_env/cli/main_create.py", line 156, in execute
result[installer_type] = installer.install(prefix, pkg_specs, args, env)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda_env/installers/conda.py", line 59, in install
unlink_link_transaction.execute()
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/core/link.py", line 284, in execute
self._execute(tuple(concat(interleave(self.prefix_action_groups.values()))))
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/core/link.py", line 747, in _execute
raise CondaMultiError(tuple(concatv(
conda.CondaMultiErrorclass: LinkError
message:
post-link script failed for package bioconda::bioconductor-genomeinfodbdata-1.2.7-r41hdfd78af_0
location of failed script: /mnt/HD1/miniconda3/envs/earlGrey/bin/.bioconductor-genomeinfodbdata-post-link.sh
==> script messages <==
==> script output <==
stdout: /mnt/HD1/miniconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
/mnt/HD1/miniconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
/mnt/HD1/miniconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
ERROR: post-link.sh was unable to download any of the following URLs with the md5sum 74c82f26111062a9ceb3c5331088cd56:
https://bioconductor.org/packages/3.14/data/annotation/src/contrib/GenomeInfoDbData_1.2.7.tar.gz
https://bioarchive.galaxyproject.org/GenomeInfoDbData_1.2.7.tar.gz
https://depot.galaxyproject.org/software/bioconductor-genomeinfodbdata/bioconductor-genomeinfodbdata_1.2.7_src_all.tar.gz
stderr: % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 416 100 416 0 0 773 0 --:--:-- --:--:-- --:--:-- 773
md5sum: WARNING: 1 computed checksum did NOT match
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 153 100 153 0 0 129 0 0:00:01 0:00:01 --:--:-- 129
md5sum: WARNING: 1 computed checksum did NOT match
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 153 100 153 0 0 155 0 --:--:-- --:--:-- --:--:-- 155
md5sum: WARNING: 1 computed checksum did NOT match
return code: 1
kwargs:
{}
: <exception str() failed>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mnt/HD1/miniconda3/bin/conda-env", line 9, in
sys.exit(main())
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda_env/cli/main.py", line 91, in main
return conda_exception_handler(do_call, args, parser)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1429, in conda_exception_handler
return_value = exception_handler(func, *args, **kwargs)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1132, in call
return self.handle_exception(exc_val, exc_tb)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1161, in handle_exception
return self.handle_application_exception(exc_val, exc_tb)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1175, in handle_application_exception
self._print_conda_exception(exc_val, exc_tb)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1179, in _print_conda_exception
print_conda_exception(exc_val, exc_tb)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/exceptions.py", line 1106, in print_conda_exception
stderrlog.error("\n%r\n", exc_val)
File "/mnt/HD1/miniconda3/lib/python3.9/logging/init.py", line 1475, in error
self._log(ERROR, msg, args, **kwargs)
File "/mnt/HD1/miniconda3/lib/python3.9/logging/init.py", line 1589, in _log
self.handle(record)
File "/mnt/HD1/miniconda3/lib/python3.9/logging/init.py", line 1598, in handle
if (not self.disabled) and self.filter(record):
File "/mnt/HD1/miniconda3/lib/python3.9/logging/init.py", line 806, in filter
result = f.filter(record)
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/gateways/logging.py", line 50, in filter
record.msg = record.msg % new_args
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/init.py", line 107, in repr
errs.append(e.repr())
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/init.py", line 64, in repr
return '%s: %s' % (self.class.name, str(self))
File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/conda/init.py", line 68, in str
return str(self.message % self._kwargs)
ValueError: unsupported format character 'T' (0x54) at index 1032
Setting path variables in script
Path variables set
Extracting zip archives
Extracted required archives
Remember to activate the earl grey conda environment before running earlGrey
earlGrey is ready to use. To execute from any directory, add earlGrey to path by pasting the code (minus the square brackets) below...
[export PATH=$PATH:$(realpath .)]
My server is based on Ubuntu, and I have never run repeatmasker and other related tools till now. So, I am not sure where is the error and how to resolve it. I will be sincerely grateful to get your help. Also, Kindly let me know if I need to provide any information in order to help you give me your opinion to resolve this issue.
thank you so much in advance,
with best regards
Amit
Hello, I'm running into what I think is an instalation and/or compatibility issue. The pipeline continues up to the generating the *_mergedRepeats stage but fails to geneate the GFF files with the message
Can't locate CrossmatchSearchEngine.pm in @INC (you may need to install the CrossmatchSearchEngine module) (@INC contains: /home/amanda/miniconda3/envs/earlGrey/bin/../ /home/amanda/miniconda3/envs/earlGrey/bin /home/amanda/miniconda3/envs/earlGrey/lib/site_perl/5.26.2/x86_64-linux-thread-multi /home/amanda/miniconda3/envs/earlGrey/lib/site_perl/5.26.2 /home/amanda/miniconda3/envs/earlGrey/lib/5.26.2/x86_64-linux-thread-multi /home/amanda/miniconda3/envs/earlGrey/lib/5.26.2 .) at /home/amanda/miniconda3/envs/earlGrey/bin/rmOutToGFF3.pl line 76.
Latter on the pipeline continues regardles up to the "Resolving Overlaping Repeats" where it also reports that:
Error in library(GenomicRanges) :
there is no package called ‘GenomicRanges’
Execution halted
I'm ataching the last 300 lines of the log file
EarlGrey.log
I'm not sure how to solve it. I'm using RepeatMasker v4.1.2-p1 and RepeatModeler v2.0.3, both installed with conda as part of the environment earlGrey. And were installed before running the configure file.
Thanks
Hello,
Thanks for your work on this tool. We were finally able to get it up and running after many dependency issues. One question I have as I experiment with earlGrey: RepeatModeler2 has the LTRStruct parameter, which we typically enabled on every run. It doesn't seem to be possible to enable this parameter within the EarlGrey commands.
Am I correct in assuming that this is disabled for a reason? Would there be any benefit to turning it on?
Thanks,
Sam
I have a working conda installation of RepeatModeler and RepeatMasker, so I wanted to install EarlGrey in the same conda environment (called “earlGrey”), following the instructions under If you already have RepeatMasker and RepeatModeler.
I get the following error on the ./configure step:
ERROR conda.core.link:_execute(730): An error occurred while installing package 'bioconda::bioconductor-genomeinfodbdata-1.2.7-r41hdfd78af_0'.
Rolling back transaction: done
class: LinkError
message:
post-link script failed for package bioconda::bioconductor-genomeinfodbdata-1.2.7-r41hdfd78af_0
location of failed script: /home/ldapusers/janneke.aylward/anaconda3/envs/earlGrey/bin/.bioconductor-genomeinfodbdata-post-link.sh
==> script messages <==
==> script output <==
stdout: /home/ldapusers/janneke.aylward/anaconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
/home/ldapusers/janneke.aylward/anaconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
/home/ldapusers/janneke.aylward/anaconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
ERROR: post-link.sh was unable to download any of the following URLs with the md5sum 74c82f26111062a9ceb3c5331088cd56:
https://bioconductor.org/packages/3.14/data/annotation/src/contrib/GenomeInfoDbData_1.2.7.tar.gz
https://bioarchive.galaxyproject.org/GenomeInfoDbData_1.2.7.tar.gz
https://depot.galaxyproject.org/software/bioconductor-genomeinfodbdata/bioconductor-genomeinfodbdata_1.2.7_src_all.tar.gz
stderr: % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 416 100 416 0 0 870 0 --:--:-- --:--:-- --:--:-- 870
md5sum: WARNING: 1 computed checksum did NOT match
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 153 100 153 0 0 135 0 0:00:01 0:00:01 --:--:-- 135
md5sum: WARNING: 1 computed checksum did NOT match
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 153 100 153 0 0 141 0 0:00:01 0:00:01 --:--:-- 142
md5sum: WARNING: 1 computed checksum did NOT match
return code: 1
kwargs:
{}
Traceback (most recent call last):
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/exceptions.py", line 1082, in call
return func(*args, **kwargs)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda_env/cli/main.py", line 80, in do_call
exit_code = getattr(module, func_name)(args, parser)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda_env/cli/main_create.py", line 142, in execute
result[installer_type] = installer.install(prefix, pkg_specs, args, env)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda_env/installers/conda.py", line 59, in install
unlink_link_transaction.execute()
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/core/link.py", line 281, in execute
self._execute(tuple(concat(interleave(itervalues(self.prefix_action_groups)))))
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/core/link.py", line 744, in _execute
raise CondaMultiError(tuple(concatv(
conda.CondaMultiErrorclass: LinkError
message:
post-link script failed for package bioconda::bioconductor-genomeinfodbdata-1.2.7-r41hdfd78af_0
location of failed script: /home/ldapusers/janneke.aylward/anaconda3/envs/earlGrey/bin/.bioconductor-genomeinfodbdata-post-link.sh
==> script messages <==
==> script output <==
stdout: /home/ldapusers/janneke.aylward/anaconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
/home/ldapusers/janneke.aylward/anaconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
/home/ldapusers/janneke.aylward/anaconda3/envs/earlGrey/share/bioconductor-genomeinfodbdata-1.2.7-0/GenomeInfoDbData_1.2.7.tar.gz: FAILED
ERROR: post-link.sh was unable to download any of the following URLs with the md5sum 74c82f26111062a9ceb3c5331088cd56:
https://bioconductor.org/packages/3.14/data/annotation/src/contrib/GenomeInfoDbData_1.2.7.tar.gz
https://bioarchive.galaxyproject.org/GenomeInfoDbData_1.2.7.tar.gz
https://depot.galaxyproject.org/software/bioconductor-genomeinfodbdata/bioconductor-genomeinfodbdata_1.2.7_src_all.tar.gz
stderr: % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 416 100 416 0 0 870 0 --:--:-- --:--:-- --:--:-- 870
md5sum: WARNING: 1 computed checksum did NOT match
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 153 100 153 0 0 135 0 0:00:01 0:00:01 --:--:-- 135
md5sum: WARNING: 1 computed checksum did NOT match
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 153 100 153 0 0 141 0 0:00:01 0:00:01 --:--:-- 142
md5sum: WARNING: 1 computed checksum did NOT match
return code: 1
kwargs:
{}
: <exception str() failed>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ldapusers/janneke.aylward/anaconda3/bin/conda-env", line 7, in
sys.exit(main())
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda_env/cli/main.py", line 91, in main
return conda_exception_handler(do_call, args, parser)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/exceptions.py", line 1374, in conda_exception_handler
return_value = exception_handler(func, *args, **kwargs)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/exceptions.py", line 1085, in call
return self.handle_exception(exc_val, exc_tb)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/exceptions.py", line 1116, in handle_exception
return self.handle_application_exception(exc_val, exc_tb)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/exceptions.py", line 1132, in handle_application_exception
self._print_conda_exception(exc_val, exc_tb)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/exceptions.py", line 1136, in _print_conda_exception
print_conda_exception(exc_val, exc_tb)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/exceptions.py", line 1059, in print_conda_exception
stderrlog.error("\n%r\n", exc_val)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/logging/init.py", line 1463, in error
self._log(ERROR, msg, args, **kwargs)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/logging/init.py", line 1577, in _log
self.handle(record)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/logging/init.py", line 1586, in handle
if (not self.disabled) and self.filter(record):
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/logging/init.py", line 807, in filter
result = f.filter(record)
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/gateways/logging.py", line 61, in filter
record.msg = record.msg % new_args
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/init.py", line 132, in repr
errs.append(e.repr())
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/init.py", line 71, in repr
return '%s: %s' % (self.class.name, text_type(self))
File "/home/ldapusers/janneke.aylward/anaconda3/lib/python3.8/site-packages/conda/init.py", line 90, in str
return text_type(self.message % self._kwargs)
ValueError: unsupported format character 'T' (0x54) at index 1120
Setting path variables in script
Path variables set
Extracting zip archives
Extracted required archives
Remember to activate the earl grey conda environment before running earlGrey
earlGrey is ready to use. To execute from any directory, add earlGrey to path by pasting the code (minus the square brackets) below...
[export PATH=$PATH:$(realpath .)]
Hi, I am trying to run EarlGrey but the software dies after the RepeatClassifier step.
I get the following error in the "Straining TEs and Refining de novo Consensus Sequences" step:
ImportError: cannot import name '_aligners' from partially initialized module 'Bio.Align' (most likely due to a circular import) (/services/tools/earlgrey/20221213/lib/python3.6/site-packages/Bio/Align/__init__.py)
Any idea what might be going on?
Dear TobyBaril,
According to the step 14. Following defragmentation, Earl Grey removes overlapping TE annotations using a custom R
script employing GenomicRanges (Lawrence et al., 2013), which ignores strand information and retains
the longest TE of overlapping pairs.
I wonder if this leads to a underrepresentation of nested TEs?
Thanks!
Hi, thanks for providing this helpful pipeline! I installed earlgrey3.2 with conda. My run died at the TEstrainer step with the following error message:
<<< Straining TEs and Refining de novo Consensus Sequences >>>
Splitting run 1
Initial trf check for 1
Initial blast and preparation for MSA 1
Primary alignment run 1
Trimming run 1
Finished extension
cat: 'TS_mAcoRus-families.fa_2547/run_*/complete_mAcoRus-families.fa': No such file or directory
Splitting for simple/satellite packages
Running TRF
Running SA-SSR
Finding SSRs:
Running mreps
Trimming and sorting based on mreps, TRF, SA-SSR
Error in dplyr::mutate()
:
ℹ In argument: period = as.double(width(ssr))
.
Caused by error:
! unable to find an inherited method for function ‘width’ for signature ‘"logical"’
Backtrace:
▆
<fn>
(<list>
, <stndrdGn>
, <env>
)└─rlang::abort(message, class = error_class, parent = parent, call = error_call)
Execution halted
Removing temporary files
Reclassifying repeats
cp: cannot stat 'TS_mAcoRus-families.fa_2547/trf/mAcoRus-families.fa.nonsatellite': No such file or directory
No database indicated or it is an empty file.
/home/huayun/mdwilson/huayun/miniconda3/envs/earlgrey/bin/RepeatClassifier - 2.0.5
NAME
RepeatClassifier - Classify Repeat Models
SYNOPSIS
RepeatClassifier [-options] -consensi
[-stockholm ]
DESCRIPTION
The options are:
-h(elp)
Detailed help
CONFIGURATION OVERRIDES
-ninja_dir
The path to the installation of the Ninja phylogenetic analysis
package.
-ucsctools_dir <string>
The path to the installation directory of the UCSC TwoBit Tools
(twoBitToFa, faToTwoBit, twoBitInfo etc).
-repeatmasker_dir <string>
The path to the installation of RepeatMasker (RM 4.1.4 or higher)
-trf_dir <string>
The full path to TRF program. TRF must be named "trf". ( 4.0.9 or
higher )
-cdhit_dir <string>
The path to the installation of the CD-Hit sequence clustering
package.
-rmblast_dir <string>
The path to the installation of the RMBLAST (2.13.0 or higher)
-recon_dir <string>
The path to the installation of the RECON de-novo repeatfinding
program.
-genometools_dir <string>
The path to the installation of the GenomeTools package.
-ltr_retriever_dir <string>
The path to the installation of the LTR_Retriever (v2.9.0 and
higher) structural LTR analysis package.
-rscout_dir <string>
The path to the installation of the RepeatScout ( 1.0.6 or higher )
de-novo repeatfinding program.
-mafft_dir <string>
The path to the installation of the MAFFT multiple alignment
program.
SEE ALSO
RepeatModeler
COPYRIGHT
Copyright 2005-2019 Institute for Systems Biology
AUTHOR
Robert Hubley [email protected]
/hpf/largeprojects/mdwilson/huayun/collabs/acomys_genomes/repeats/earlgrey/test/mAcoRus_EarlGrey/mAcoRus_strainer
Compiling library
cat: TS_mAcoRus-families.fa_2547/trf/mAcoRus-families.fa.satellites: No such file or directory
ERROR: TEstrainer failed to produce a strain file, please check the log file for more information
Any insights would be helpful!
Thanks!
Hi Tobias,
I find you used trimAl to clean multiple sequence alignments. Based on your code, trimAl will remove columns in the MSA that have more than 60% gaps and columns that have a lower than 60% consensus.
In my opinion, that won't be sufficient to clean MSA and find the right boundaries for TEs.
subprocess.run(SOFTWARE + 'trimal/source/trimal -in {} -gt 0.6 -cons 60 -fasta -out {}'.format('muscle/' + ALIGNED, 'muscle/' + FILEPREFIX + '_trimal.fa'), shell=True)
Yours sincerely
Jiangzhao
Heya,
Great tool, just updated to the newer version and have some suggestions for the docs. I can't used docker on our system so had to install without sudo. Changes I'd suggest:
Aware it's probably quite busy with the latest updates, I'd be happy to PR some changes if that's easier. Still the most painless TE annotator I've encountered to date :)
Cheers
Hi,
Thanks for developing EarlGrey, it is very well explained and I hope I can complete my analyses with it! Here is my problem:
I installed EarlGrey following the installation guide, step by step, RepeatMasker, RepeatModeller2, etc. The pipeline runs smoothly until the "Straining TEs and Refining de novo Consensus Sequences" step. At this point, the programme builds a new DB and seems to add all the contigs all right, but then it runs into "Permission denied" and crashes.
Changing folder and subfolder permissions doesn't seem to change anything. If this problem has to do with TEstrainer, I could not find any help elsewhere. Any help would be much appreciated!
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1640 sequences in 17.0914 seconds.
Splitting run 1
Initial trf check for 1
0% 0:2556=0s rnd-1_family-78.fasta
/bin/bash: TS-families.fa_8111/run_1/raw/rnd-1_family-66.fasta: Permission denied
0% 1:2555=0s rnd-1_family-21.fasta
/bin/bash: TS-families.fa_8111/run_1/raw/rnd-1_family-84.fasta: Permission denied
0% 2:2554=0s rnd-1_family-91.fasta
[...]
99% 2555:0=0s rnd-5_family-2924.fasta
/bin/bash: TS_ipomoeaTriloba-families.fa_7746/run_1/raw/rnd-5_family-2924.fasta: Permission denied
100% 2556:0=0s rnd-5_family-2924.fasta
Initial blast and preparation for MSA 1
0% 0:2556=0s rnd-1_family-78.fasta
I am trying to follow the protocol for TE identification from Baril & Hayward 2022 for TE identification. I am using a custom library so I want to use the BEE script (TEstrainer) from earl grey (after I have completed the repeat modeler and repeat masker). But there are differences between TEstrainer for earlgrey and TEstrainer. Would it be ok to use TEstrainer outside earlgrey to complete the BEE step? Thanks
I had some trouble getting earlgrey installed via conda. Here is what ended working for me. I had to create an env as below and then install earlgrey. Running the conda install command as per your github instructions didnt work for me.
conda create -n earlgrey_env -c conda-forge python=3.8.15 mamba=1.2
conda activate earlgrey_env
mamba install -c bioconda earlgrey
I tested it with the NC_045808_EarlWorkshop.fasta from gitpod and it worked.
Hi,Professor,thank you for developing this pileline for annotation of TEs. Recently, I used this pileline to annotate a genome of arthropod. My genome size was 2.7Gb. After it runned more than 27 hours, it stopped runed, the last 50 lines of log file are shown below:
RepeatScout/RECON discovery complete: 4091 families found
Program Time: 27:15:15 (hh:mm:ss) Elapsed Time
Working directory: /public/home/rp1016swf/02.EarlG/spe_EarlGrey/spe_RepeatModeler/RM_28569.SatApr80935442023
may be deleted unless there were problems with the run.
The results have been saved to:
/public/home/rp1016swf/02.EarlG/spe_EarlGrey/spe_Database/spe-families.fa - Consensus sequences for each family identified.
/public/home/rp1016swf/02.EarlG/spe_EarlGrey/spe_Database/spe-families.stk - Seed alignments for each family identified.
/public/home/rp1016swf/02.EarlG/spe_EarlGrey/spe_Database/spe-rmod.log - Execution log. Useful for reproducing results.
The RepeatModeler stockholm file is formatted so that it can
easily be submitted to the Dfam database. Please consider contributing
curated families to this open database and be a part of this growing
community resource. For more information contact [email protected].
) (
( ) )
) ( (
_______)_
.-'---------|
( C|/\/\/\/\/|
'-./\/\/\/\/|
'_________'
'-------'
<<< Straining TEs and Refining de novo Consensus Sequences >>>
Refining genome not found
Usage: [-l Repeat library] [-g Genome ] [-t Threads (default 4) ] [-f Flank (default 1000) ] [-r Numver of iterations of BEET to run (deafult 10)] [-d Out directory, if not specified wil be created ] [-h Print this help] [-M Ammount of memory TEstrainer needs to keep free]
cp: cannot stat ‘/public/home/rp1016swf/02.EarlG/spe_EarlGrey/spe_strainer/TS*/spe-families.fa.strained’: No such file or directory
ERROR: TEstrainer failed to produce a strain file, please check the log file for more information
I entered the director"/public/home/rp1016swf/02.EarlG/spe_EarlGrey/spe_strainer", it is an empty directory. However, it runned successfully and produced all result files when i tested it with a small genome file. I don't know what's wrong with its and how should i solve this problem.
Looking forward to your reply.
Hi,
I like this tool. I am currently using something a former colleague created a few years ago panTE. It is very exhaustive but suffers from some of the problems mentioned in your preprint, mainly overlapping features. It is also not being developed anymore.
I just ran into a problem with the GFF file produced by EarlGrey, it does not work when imported into Geneious.
First Geneious complains about the 'NA' in colum 8 (phase of feature). This should be either 0, 1 or 2 for CDS features or '.' for everything else.
After fixing that Geneious imports the file fine, but seems to connect all features that have the same (non unique) "ID" tag.
AlKewell_ctg_01 RepeatMasker Unknown 859 1137 1636 - . Tstart=903;Tend=1159;ID=RND-1_FAMILY-96;shortTE=F
AlKewell_ctg_01 RepeatMasker Unknown 6213 6877 1979 - . Tstart=771;Tend=1440;ID=RND-4_FAMILY-443;shortTE=F
AlKewell_ctg_01 RepeatMasker Unknown 7789 9093 8533 - . Tstart=868;Tend=2242;ID=RND-1_FAMILY-32;shortTE=F
AlKewell_ctg_01 RepeatMasker Unknown 9095 9331 1007 - . Tstart=718;Tend=953;ID=RND-1_FAMILY-103;shortTE=F
AlKewell_ctg_01 RepeatMasker Unknown 9536 10383 1843 - . Tstart=91;Tend=729;ID=RND-1_FAMILY-103;shortTE=F
AlKewell_ctg_01 RepeatMasker LTR/Gypsy 10481 12295 10599 - . Tstart=6031;Tend=7846;ID=RND-4_FAMILY-173;shortTE=F
AlKewell_ctg_01 RepeatMasker LTR/Gypsy 12513 16028 21611 - . Tstart=2497;Tend=6035;ID=RND-4_FAMILY-173;shortTE=F
AlKewell_ctg_01 RepeatMasker DNA/hAT-Ac 16234 16507 1361 - . Tstart=6895;Tend=7199;ID=RND-1_FAMILY-1;shortTE=F
AlKewell_ctg_01 RepeatMasker LTR/Gypsy 17463 19109 10105 - . Tstart=860;Tend=2505;ID=RND-4_FAMILY-173;shortTE=F
AlKewell_ctg_01 RepeatMasker LTR/Gypsy 19937 20871 3210 - . Tstart=755;Tend=1649;ID=RND-1_FAMILY-9;shortTE=F
AlKewell_ctg_01 RepeatMasker LTR/Gypsy 21089 21755 1529 + . Tstart=7174;Tend=7847;ID=RND-4_FAMILY-173;shortTE=F
As an imporovement I'd love to see a GFF file that is ready to run through NCBI's table2asn converter to produce annotations that can be submitted.
Thanks for this work!
Cheers,
Johannes
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.