aTEA (active TE annotation) is an analysis pipeline to identify active TE transcripts (including both autonomously expressed TE and TE-gene chimeric transcripts) from long-read RNA-seq data (e.g., PacBio Iso-seq or ONT cDNA-RNA sequencing).
Linux system
Python 3.8.17 or latest
Perl 5.32
Isoseq (https://github.com/pacificbiosciences/isoseq/)
BamTools (v2.5.1) (https://github.com/pezmaster31/bamtools)
minimap2 (v2.24) (https://github.com/lh3/minimap2)
TALON (v5.0) (https://github.com/mortazavilab/TALON)
CD-HIT (v4.8.1) (https://github.com/weizhongli/cdhit)
gffread (v0.12.7) (https://github.com/gpertea/gffread)
Reference genome: GRCz11.fa
(https://ftp.ensembl.org/pub/release-103/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz)
Gene annotation: Danio_rerio.GRCz11.103.chr.gtf
(https://ftp.ensembl.org/pub/release-103/gtf/danio_rerio/Danio_rerio.GRCz11.103.chr.gtf)
TE annotation: GRCz11.TE.fa.out
(https://figshare.com/articles/dataset/zebrafish_TE_annotation_RepeatMasker_/24991395)
BED12: GRCz11.bed12
which can be obtained by using this command paftools.js gff2bed Danio_rerio.GRCz11.103.chr.gtf > GRCz11.bed12
This tool is not required to install on your local computer. To run it,
- Download the package
aTEA-0.1.tar.gz
to your local server - Unpack it using the command
tar -zxvf aTEA-0.1.tar.gz
- Put all required genomic data into the same folder
- Make sure the required tools to be added into your PATH
- Run
bash aTEA.sh
We provide a demo data for testing, which is a CCS bam file containing 20,000 CCS reads. You can expect to get transcript identification results and TE classification results from this demo analysis. In addition, several other relevant data for analysis are also included. This demo run should be finished in 10 mins on a standard Linux computer with 10 cores.
- CCS bam file:
demo.ccs.bam
- Zebrafish reference genome:
GRCz11.fa
- Zebrafish genome annotation:
Danio_rerio.GRCz11.103.chr.gtf
- Zebrafish TE annotation:
GRCz11.TE.fa.out
- Isoseq primer sequences:
primers.fasta
- BED12 file:
GRCz11.bed12
- TALON config file:
config_file_demo
- Transcript identification:
final_talon.gtf
(transcript identification results in gtf format) andfinal_talon.transcripts.fa
(sequences file) - Transcript quantification:
*.combined.filterlow
(reads count, TPM for each transcript) - Transcript classification:
Demo.TE.annotation.list2
(TE-alone transcript with TE annotation);Demo.TE-Gene.annotation.list2
(TE-gene transcript with TE annotation) andDemo.Gene.annotation.list2
(TE-free gene transcripts)
- Generate CCS reads from subreads bam:
ccs your.subreads.bam your.ccs.bam
- Prepare genomic data for your own species, including genome sequence, gene annotation, TE annotation, etc.
- Replace the name of demo file with your data: In aTEA.sh, replace the demo with your. If you have multiple samples/replicates to run, simply list all name in the loop, e.g.,
sample1 sample2 sample3 ..
. - Prepare config file for TALON: replace the path for the labeled SAM file in the file config_file_demo, e.g.,
/your/home/workdir/labeled/your_sample_labeled.sam
. - Run
bash aTEA.sh
or submit the job to a computing cluster.