jiekailab / scte Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Hi there,
I've been trying to run scTE on a dataset of human PBMCs. I ran CellRanger count on the FASTQs, got a bam file, created a human genome index with scTE_build using the same genes.gtf file that was used in the CellRanger reference package, then ran scTE on that bam file.
The results from scTE said that there were 1172 cells detected expressing at least 200 genes, and the results from running CellRanger count also indicated that there were an estimated 3500 genes expressed per cell. However, the csv file had almost no reads for the genes. Only 4 genes were found to have read counts of over 100 within some cells, which doesn't seem like it would be enough to separate the cells into clusters.
Attached below are some screenshots illustrating what I ran, the CellRanger count output, as well as a screenshot of my csv file. Just wondering if you had any input on this issue. Thank you in advance!
Hi , thanks for your scTE !!
Now I want to use the scTE_build command to generate the annotation file, but my TE file is the repeatmasker generated .out file. I have noticed that TE needs to be in bed format, but I am not sure which columns of files in the.out file are needed. In addition, I wonder if single cell data analysis with some non-model animals is feasible?
Hi,
I ran scTE with the following options, and it threw an IndexError. I am pasting the details below. Could you please help?
Python version is 3.9. mm39 index was successfully created with scTE_build, I had included the option in your code. A bam file generated with cellranger v7.0 was used as input.
================
$ scTE -i input.bam -o out -x mm39.exclusive.idx -p 80 -CB CB -UMI UB
DEBUG : Creating converter from 7 to 5
DEBUG : Creating converter from 5 to 7
DEBUG : Creating converter from 7 to 5
DEBUG : Creating converter from 5 to 7
INFO : Parameter list:
Sample = out
Reference annotation index = mm39.exclusive.idx
Minimum number of genes required = 200
Minimum number of counts required = None
Number of threads = 80
INFO : Loading the genome annotation index... 2022-08-16 16:32:08
INFO : Loaded 'mm39.exclusive.idx' binary file with 4018326 items
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '3', '4', '5', '6', '7', '8', '9', 'M', 'X', 'Y']
INFO : Finished loading the genome annotation index... 2022-08-16 16:32:54
INFO : Processing BAM/SAM files ...2022-08-16 16:32:54
INFO : Input SAM/BAM file appears to be valid
CB UB good
INFO : Done BAM/SAM files processing ...2022-08-16 16:59:20
INFO : Splitting ...2022-08-16 16:59:20
INFO : Executing multiple thread path with 80 threads
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/scTE/base.py", line 366, in splitChr
CRs[t[3]] += 1
IndexError: list index out of range
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/bin/scTE", line 4, in
import('pkg_resources').run_script('scTE==1.0', 'scTE')
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/lib/python3.9/site-packages/pkg_resources/init.py", line 665, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/lib/python3.9/site-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE", line 169, in
main()
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE", line 134, in main
pool.map(partial_work, chr_list)
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/lib/python3.9/multiprocessing/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
IndexError: list index out of range
================
Thanks,
Sam
Hi there,
I've been trying to run scTE on the provided test BAM file, however using the parameters -CB CR -UMI UR gives the error message that the bam file has no cell barcodes information and that I should set CB and UMI to False. Prior to running scTE, I've loaded the virtual environment from conda to ensure I have all the required packages installed.
The command that I'm entering is
$ scTE -i test.bam -o test -x mm10.exclusive.idx --hdf5 False -CB CR -UMI UR
With the error message being ERROR : The input file /scratch/st-mtokuy01-1/awang00/scTE/Data/test.bam has no cell barcodes information, plese make sure the aligner have add the cell barcode key, or set CB to False
Any advice on what I'm doing wrong would be appreciated. Thanks
EDIT: I was missing samtools, so once I got that loaded in, then it was able to recognize the CR:Z and UR:Z tags. However, just like #48, I'm receiving the message INFO : Detect 0 cells expressed at least 200 genes, results output to test.csv
.
Hello,
Could you generate package release as per the version number on github, so we can download as per specific versions. I can see the latest version available is v1.0. But there is a difference in the package the one we tried "git clone" and the tar file.
Please let us know
Thanks
Jay
Hi - thank you for creating such a unique/useful tool!
After testing the pipeline on one of the "out of the box" supported genomes (mm10) successfully, we are trying to use it to measure TE expression in single cells for the African turquoise killifish. We are using the NCBI genome annotation files (gtf) and ran repeatmasker to get the TE bed file. Everything looks ok based on the sample files provided here on Github.
We have the most recent version of the package, and we use this command to build the index:
scTE_build -te 2015_Genome_scTE.bed -gene GCF_001465895.1_Nfu_20140520_genomic_CLEAN_MT_exon.filtered.gtf -o Nfu_20140520 -g other
Although the run starts ok, we get a warning almost immediately that does not stop the run:
/usr/local/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE_build:110: DeprecationWarning: 'U' mode is deprecated
o = open(tefile,'rU')
However, after a few minutes, the program aborts with the following error:
Traceback (most recent call last):
File "/usr/local/bin/scTE_build", line 4, in <module>
__import__('pkg_resources').run_script('scTE==1.0', 'scTE_build')
File "/usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 651, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 1448, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE_build", line 468, in <module>
main()
File "/usr/local/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE_build", line 461, in main
genomeIndex(args.genome,args.mode,tefile,genefile, args.out,'No path','No path')
File "/usr/local/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE_build", line 127, in genomeIndex
gls.load_list(clean)
File "/usr/local/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/scTE/miniglbase/genelist.py", line 1472, in load_list
list_to_load[0]
IndexError: list index out of range
Would you be able to assist in figuring out what the problem is? We are very excited to use the package on the African turquoise killifish data but won't be able to until we can have an index...
Thank you so much for your help in advance!
Hi,
Thanks for scTE.
I am trying to run scTE on 4 samples using the following command :
#!/bin/bash
scTE -i *.bam -o out -x /home/urangasw/Softwares/scTE/out.inclusive.idx --min_genes 100 --min_counts 400 -p 16
I use to following parameters to submit the job :
sbatch --ntasks=1 --cpus-per-task=32 --mem=90000mb --partition=long1 --time=48:00:00 --qos=fastlane temp.sh
It's been more than a day now and the programs seems to be stuck at :
INFO : Parameter list:
Sample = out
Reference annotation index = /home/urangasw/Softwares/scTE/out.inclusive.idx
Minimum number of genes required = 100
Minimum number of counts required = 400
Number of threads = 16
INFO : Loading the genome annotation index... 2021-12-26 11:45:22
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Finished loading the genome annotation index... 2021-12-26 11:46:20
INFO : Processing BAM/SAM files ...2021-12-26 11:46:20
INFO : Input SAM/BAM file appears to be valid
INFO : Using parabam2bed as more than 1 input BAM
['1', '10', '10_GL383545V1_ALT', '10_GL383546V1_ALT', '10_KI270824V1_ALT', '10_KI270825V1_ALT', '10_KN196480V1_FIX', '10_KN538365V1_FIX', '10_KN538366V1_FIX', '10_KN538367V1_FIX', '10_KQ090020V1_ALT', '$
sed: couldn't write 49 items to stdout: Broken pipe
sed: couldn't write 49 items to stdout: Broken pipe
sed: couldn't write 54 items to stdout: Broken pipe
awk: cmd. line:1: (FILENAME=- FNR=347939132) fatal: print to "standard output" failed (Broken pipe)
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1
sed: couldn't write 49 items to stdout: Broken pipe
sed: couldn't write 49 items to stdout: Broken pipe
sed: couldn't write 54 items to stdout: Broken pipe
awk: cmd. line:1: (FILENAME=- FNR=538652353) fatal: print to "standard output" failed (Broken pipe)
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1
INFO : Done BAM/SAM files processing ...2021-12-26 15:14:43
INFO : Splitting ...2021-12-26 15:14:44
INFO : Executing multiple thread path with 16 threads
UR CR
UR CR
UR CR
UR CR
INFO : Finished processing sample files 2021-12-26 18:12:14
INFO : Fetching from the annotation index... 2021-12-26 18:12:14
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
Each of the 4 BAM files are approx 40GB in size. Please let me know what I should change to run the program successfully. Thanks in advance!
Hi,
When I tried to run the command: $ scTEATAC_build -g mm10.te.bed -o mm10.te.atac, I met an error that says: -bash: scTE_scatacseq: command not found. It seems that scatacseq is not included in your repository. Could you please give a guide about how to run scTE on the scATAC data in more details?
Thank you in advance.
Best wishes,
Apollo
Thank you for this great tool!
For motif enrichment analyses within TEs, would you recommend using the consensus sequence from Dfam? Or is there another method you use to obtain the TE sequence, especially since expression is not on a locus level? Thank you!
Hi,
While trying to use this very nice package but I encounter an error apparently due to the bam file I use
I am using cellranger/5.0.1 to generate the bam file
and get the error while running scTE:
ERROR : The input file /scratch/Maize_Primary_Align/outs/possorted_genome_bam.bam has no cell barcodes information, plese make sure the aligner have add the cell barcode key, or set CB to False
I have made sure that my option were set as --hdf5 True -CB CB -UMI UB
I noticed that my bam file has extra info in it compare to the example on the scTE github page
especially 4 additional info between RG: (RG:Z:Maize_Primary_Align:0:1:H5GLJDRX2:2) and RE (RE:A:E)
they are:
TX:Z:Zm00001d027288_T001,+617,91M | GX:Z:Zm00001d027288 | GN:Z:Zm00001d027288 | fx:Z:Zm00001d027288
Do you think that could be the issue of the error I am getting ?
For the rest I have all the info including CB and UB so the cell barcodes are in the bam file.
What would be the best way to remove these additional info ?
Thank you to all users in advance,
B
scTE fails to find cell barcode information in bam files I generated using the cell ranger pipeline:
$ scTE -i possorted_genome_bam.bam -o out_rep2 -x /software/scTE/mm10.exclusive.idx --hdf5 True -CB CB -UMI UB
DEBUG : Creating converter from 7 to 5
DEBUG : Creating converter from 5 to 7
DEBUG : Creating converter from 7 to 5
DEBUG : Creating converter from 5 to 7
INFO : Parameter list:
Sample = out_rep2
Reference annotation index = /software/scTE/mm10.exclusive.idx
Minimum number of genes required = 200
Minimum number of counts required = None
Number of threads = 1
INFO : Loading the genome annotation index... 2021-04-02 18:14:28
INFO : Loaded '/software/scTE/mm10.exclusive.idx' binary file with 3900779 items
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '3', '4', '5', '6', '7', '8', '9', 'M', 'X', 'Y']
INFO : Finished loading the genome annotation index... 2021-04-02 18:15:01
INFO : Processing BAM/SAM files ...2021-04-02 18:15:01
ERROR : The input file possorted_genome_bam.bam has no cell barcodes information, plese make sure the aligner have add the cell barcode key, or set CB to False
The bam files have CB and UB flags:
$ samtools view possorted_genome_bam.bam | head
A00521:52:HHVH7DMXX:1:2126:5737:27273 16 chr1 3000239 255 91M * 0 0 TTTCATCCAGGTTTTCCTGGTTTTTTTTTAGTATAGCCTTTCATAGTAGAATCTGATGATGTTTTTGATATCCTCATGTTCTGTTGTTATG FFFF:FFFFFFFF:FFFFFFFFFFFFF,FFFFF,FFFFF:FFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF NH:i:1 HI:i:1 AS:i:85 nM:i:2 RG:Z:Rachel_control_rep2:0:1:HHVH7DMXX:1 RE:A:I xf:i:0 CR:Z:ATTATCCCAGTATGCT CY:Z:FFFFFFFFFFFFFFFF CB:Z:ATTATCCCAGTATGCT-1 UR:Z:AGGTCCACTT UY:Z:FFFFFFFFFF UB:Z:AGGTCCACTT
A00521:52:HHVH7DMXX:1:2126:5936:27398 16 chr1 3000239 255 91M * 0 0 TTTCATCCAGGTTTTCCTGGTTTTTTTTTAGTATAGCCTTTCATAGTAGAATCTGATGATGTTTTTGATATCCTCATGTTCTGTTGTTATG FF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:85 nM:i:2 RG:Z:Rachel_control_rep2:0:1:HHVH7DMXX:1 RE:A:I xf:i:0 CR:Z:ATTATCCCAGTATGCT CY:Z:FFFFFFFFFFFFFFFF CB:Z:ATTATCCCAGTATGCT-1 UR:Z:AGGTCCACTT UY:Z:FFFFFFFFFF UB:Z:AGGTCCACTT
A00521:52:HHVH7DMXX:1:1470:6668:19617 16 chr1 3000373 255 91M * 0 0 TATGCCCTCTAGTTAGTCTGGCTAAGGGTTTATCTATCTTGTTGACTTTCTCAAAGAACCAGCTACTAGTTTGGTTGATTCTTTGAATATT FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:87 nM:i:1 RG:Z:Rachel_control_rep2:0:1:HHVH7DMXX:1 RE:A:I xf:i:0 CR:Z:ACTTACTCACAGGCCT CY:Z:FFFFFFFFFFFFFFFF CB:Z:ACTTACTCACAGGCCT-1 UR:Z:TGGTGTTGGT UY:Z:FFFFFFFFFF UB:Z:TGGTGTTGGT
A00521:52:HHVH7DMXX:2:1410:7952:4460 16 chr1 3009349 1 1S65M25S * 0 0 GTTTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGTGAACTAACCCATGTACTCTGCGTTGATACCAC FFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:3 HI:i:1 AS:i:59 nM:i:2 ts:i:25 RG:Z:Rachel_control_rep2:0:1:HHVH7DMXX:2 RE:A:I xf:i:0 CR:Z:TAGACCATCCGATATG CY:Z:FFF:FFFFFFFFFFFF CB:Z:TAGACCATCCGATATG-1 UR:Z:AACTATAACG UY:Z:FFFFFFFFFF UB:Z:AACTATAACG
when I use scTE, I got this message:sh: samtools 未找到命令
And I got an empty matrix . I could use samtools in my cluster normally
Hi!
Thank you for the pipeline, it works really well with my 10X mouse single cell RNA-seq data.
I have a question concerning the downstream analysis, the build you have for the mouse TE contains meta-genes based on the name of the TE such as "IAPEz.int" or "IAPLTR1a−Mm".
I would like to know if you have a solution to show the overall expression of an entire family or class of TE, such as LTR or ERV.
I guess I would require to build a custom index using the column from the class and family rather than name in the bed file. I wanted to know if you have thought of that feature?
Thanks again!
Hi,
$python ./bin/scTEATAC -i CD_ICA.bam -x hg38.te.atac.idx -o out_atac
INFO : Arguments:
INFO : out: out_atac
INFO : index: hg38.te.atac.idx
INFO : Minimum number of counts required = 1000
INFO : Number of threads = 1
INFO : Loading the genome annotation index... 2022-10-07 17:43:47
INFO : Loaded 'hg38.te.atac.idx' binary file with 4005310 items
INFO : Finished loading the genome annotation index... 2022-10-07 17:44:27
INFO : Processing BAM/SAM files ...2022-10-07 17:44:27
*****WARNING: Query GTGTCAAAGGCAGTAC:150385822#0 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.
(many warning)
*****WARNING: Query AGGACGATCGGTAGGA:188109048#0 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.
INFO : Done BAM/SAM files processing ...2022-10-07 17:55:22
INFO : Splitting ...2022-10-07 17:55:22
INFO : Executing single thread path
INFO : Finished processing sample files 2022-10-07 17:55:37
INFO : Fetching from the annotation index... 2022-10-07 17:55:37
Traceback (most recent call last):
File "./bin/scTEATAC2", line 216, in
main()
File "./bin/scTEATAC2", line 185, in main
align(chr=chrom, filename=outname, all_annot=None, glannot=glannot, whitelist=whitelist, CB=args.CB)
TypeError: align() got an unexpected keyword argument 'CB'
================
We changed some code in scTEATAC for the following error:
'''
$python ./bin/scTEATAC -i CD_ICA.bam -x hg38.te.atac.idx -o out_atac
usage: scTE_scatacseq [-h] [--ondisk] [--min_counts INT] [-CB [{True,False}]] [-UMI [{True,False}]]
[--ignoreDuplicates [{True,False}]] [--keeptmp [{True,False}]] [-p INT] [--hdf5 [{True,False}]] -i INPUT
[INPUT ...] -o [OUT] -x ANNOGLB [ANNOGLB ...] -g [genome]
scTE_scatacseq: error: the following arguments are required: -g/--genome
'''
delete 'genome=args.genome' in line 144
delete line93 and line 94
Thanks
Xia
Hi.
In your publication :"Analysis of Alzheimer’s disease scRNA-seq data. The MARS-seq scRNA-seq raw data were download from GSE9896971. The raw fastq file were modified using custom scripts to embed the cell barcode and UMI in the same read, as in the 10x scRNA-seq format. " What's the custom scripts?My raw data is from STRT-seq, barcode is in read1 and UMI is in read2, do them need to be modified to the 10x scRNA-seq format?
Hi,
I have run scTE on custom Zea mays genome (known as Corn), using a bam file from 10x genomic CellRanger, giving me about 5000 good cells.
Everything went smoothly but:
1- I cannot see any TE in the out.h5ad file.
2- Only like 15% of the genes I detected using CellRanger are present in the out.h5ad, and only the genes with a gene name. Plant genome are badly annotated and only a few genes get a name, even sometimes you get the same name for two separate genes ^^. Only the gene ID should be used in plant genomes.
Is there a way to ensure that only the gene ID is used and not the gene name ?
examples of the last column of the Zea mays gtf file from Plant.ensemble:
gene with name:
gene_id "Zm00001d048603"; gene_name "GRAS-transcription factor 83"; gene_source "gramene"; gene_biotype "protein_coding";
ex gene without name:
gene_id "Zm00001d027230"; gene_source "gramene"; gene_biotype "protein_coding";
Should I modify the gtf manually ?
3- Could you precise what are the 6 columns of the TE bed file that need to be included?
In the UCSC website cited in the tutorial the definition of a bed file is "BED lines have three required fields and nine additional optional fields;
1- chrom - The name of the chromosome
2- chromStart - The starting position of the feature in the chromosome
3- chromEnd - The ending position of the feature in the chromosome"
What are the three other columns necessary for scTE ?
Does one of the TE need a gene_name "TExxxx" ?
As an example xenopus in the tutorial (https://hgdownload.soe.ucsc.edu/goldenPath/xenTro9/database/rmsk.txt.gz) has like 17 columns.
Finally, I converted the out.h5ad format to an Seurat Object using SeuratDisk package
Convert("out.h5ad", dest = "h5seurat", overwrite = TRUE)
pbmc3k <- LoadH5Seurat("out.h5seurat")
Is that a good way to do so ?
Is there an easier way to get the expression matrix from the h5ad?
Thank you very much in advance for your help,
Bruno
Sample = out
Reference annotation index = mm10.exclusive.idx
Minimum number of genes required = 200
Minimum number of counts required = None
Number of threads = 1INFO : Loading the genome annotation index... 2022-11-17 10:34:45
INFO : Loaded 'mm10.exclusive.idx' binary file with 3900779 items
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '3', '4', '5', '6', '7', '8', '9', 'M', 'X', 'Y']
INFO : Finished loading the genome annotation index... 2022-11-17 10:35:33INFO : Processing BAM/SAM files ...2022-11-17 10:35:33
INFO : Input SAM/BAM file appears to be valid
CR UR goodsh: 1: gzip: Exec format error
INFO : Done BAM/SAM files processing ...2022-11-17 10:35:39INFO : Splitting ...2022-11-17 10:35:39
INFO : Executing single thread path
INFO : Finished processing sample files 2022-11-17 10:35:39INFO : Fetching from the annotation index... 2022-11-17 10:35:39
sh: 1: gzip: Exec format error
/usr/bin/gunzip: 57: exec: gzip: Exec format error
INFO : Done fetching... 2022-11-17 10:35:39INFO : Calculating expression... 2022-11-17 10:35:39
INFO : Detect 0 cells expressed at least 200 genes, results output to out.csv
INFO : Finished calculating expression 2022-11-17 10:35:39
INFO : Done with 0d 0h 0m 53s
I used test.bam to run it. But the results showed detect 0 cells expressed.
Does it mean still not working?
It appears the code base has a hidden dependency on samtools
. Every run I attempted produced the following error:
bam has no cell barcodes information, plese make sure the aligner have add the cell barcode key, or set CB to False
untill I installed samtools
on my system. Only then was I able to pass the CB/UMI check. It looks like samtools
is being called from here among other places.
You may want to add this to your installation instructions so that users are aware.
I also had a question regarding the recommended settings for running STARsolo
. In the readme, you recommend -CB CR -UMI UR
for STARsolo
but recommend -CB CB -UMI UB
for Cell Ranger
. Is there any reason you recommend raw/uncorrected tags for STARsolo
but corrected tags for Cell Ranger
? In general which tags are best to use?
Hi thank you for the nice pipeline, I liked your article a lot!
Could you share with me an example of Star solo mapping of sc-atac-seq data to have the 'CR:Z' or 'UR:Z' tags in the bam file?
I have tried the following on the 10Kpbmc sc-atac-seq data (10X example dataset) with STAR 2.7.8a:
STAR --genomeDir $genomedir
--readFilesIn atac_pbmc_10k_v1_S1_L001_R3_001.fastq.gz,atac_pbmc_10k_v1_S1_L002_R3_001.fastq.gz \
atac_pbmc_10k_v1_S1_L001_R1_001.fastq.gz,/atac_pbmc_10k_v1_S1_L002_R1_001.fastq.gz\
--runRNGseed 42 --runThreadN 12 --readFilesCommand zcat \
--outFilterMultimapNmax 100 --winAnchorMultimapNmax 100 --outSAMmultNmax 1 --outSAMtype BAM SortedByCoordinate --twopassMode Basic --outWigType wiggle --outWigNorm RPM\
--soloType CB_UMI_Simple \
--soloCBwhitelist 737K-august-2016.txt \
--soloBarcodeReadLength 0
This is what I could understand from the star solo documentation but it's wrong because the bam file has empty values for
'CR:Z' or 'UR:Z'.
samtools view Aligned.sortedByCoord.out.bam | head -1
A00519:269:H7FM2DRXX:2:2137:17978:8860 0 chr1 3004633 255 1S48M * 0 0 GCCTAGAATATTATGCCCAACAAAACTATCTTTCAGAAATGAAGGAGAA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:33 nM:i:7
Thanks in advance!
Jihed
Hi,
I used the test.bam in the folder to run
scTE -i test.bam -o out -x mm10.exclusive.idx --hdf5 True -CB CR -UMI UR
Sample = out
Reference annotation index = mm10.exclusive.idx
Minimum number of genes required = 200
Minimum number of counts required = None
Number of threads = 1
INFO : Loading the genome annotation index... 2022-11-01 12:23:40
INFO : Loaded 'mm10.exclusive.idx' binary file with 3900779 items
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '3', '4', '5', '6', '7', '8', '9', 'M', 'X', 'Y']
INFO : Finished loading the genome annotation index... 2022-11-01 12:24:30
INFO : Processing BAM/SAM files ...2022-11-01 12:24:30
INFO : Input SAM/BAM file appears to be valid
CR UR good
sh: 1: gzip: Exec format error
INFO : Done BAM/SAM files processing ...2022-11-01 12:24:36
INFO : Splitting ...2022-11-01 12:24:36
INFO : Executing single thread path
INFO : Finished processing sample files 2022-11-01 12:24:36
INFO : Fetching from the annotation index... 2022-11-01 12:24:36
sh: 1: gzip: Exec format error
/usr/bin/gunzip: 57: exec: gzip: Exec format error
INFO : Done fetching... 2022-11-01 12:24:36
Hi Jiekai Lab,
Very nice work !
one problem is that the chromosome number (1,2,3,4....., X, Y.) of the bam file must be consistent with the chromosome number (chr1,chr2.....,chrX,chrY) of the gene annotation file (hg38.exclusive.idx) ?
Actually, my bam files use chromosome numbers 1, 2, 3, 4...., X, Y. However, the genome annotations file (hg38.exclusive.idx) generated by scTE was numbered chr1, chr2..., chrX, chrY . Then, I tested my bam files and annotated with hg38.exclusive.idx and get the result, but I don’t know if the result obtained by this way is right?
Thank you very much.
Kui Duan
Brilliant package and super easy to use!
I am trying to add some custom transgenes (TagBFP2,EGFP, TdTomato, etc.) as I have successfully done in the past with CellRanger and StarSolo but I can't seem to get it to work. (They don't appear and the TE results differ with the custome ref.)
Specifically, I use:
awk 'BEGIN{FS=OFS="\t"}{print $6,$7,$8,$11,$3,$10}' mus.rmsk.txt > mm10rmsk.bed
That comes out like this:
chr1 3000000 3002128 L1_Mus3 105 -
chr1 3003152 3003994 L1Md_F 268 -
chr1 3003993 3004054 L1_Mus3 279 -
Then I run with tmp.gtf being the input I use for making my StarSolo, kbtools, or CellRanger refs with transgenes as additional chromosomes (I'm assuming this is the problem?):
scTE_build -te mm10rmsk.bed -gene tmp.gtf -o custome
[Gtf tail]
BFP2 BFP2 exon 1 89 . + 0 gene_id "BFP2"; transcript_id "BFP2.1"; gene_name "BFP2";
BFP2 BFP2 transcript 1 89 . + 0 gene_id "BFP2"; transcript_id "BFP2.1"; gene_name "BFP2";
mTom mTom exon 1 670 . + 0 gene_id "mTom"; transcript_id "mTom.1"; gene_name "mTom";
mTom mTom transcript 1 670 . + 0 gene_id "mTom"; transcript_id "mTom.1"; gene_name "mTom";
mGFP mGFP exon 1 207 . + 0 gene_id "mGFP"; transcript_id "mGFP.1"; gene_name "mGFP";
mGFP mGFP transcript 1 207 . + 0 gene_id "mGFP"; transcript_id "mGFP.1"; gene_name "mGFP";
And then:
scTE -p 20 -i /star_out/PBS/Aligned.sortedByCoord.out.bam -o PBScustTE -x /custome.exclusive.idx --hdf5 True -CB CB -UMI UB
The custom transgenes are in the Bam.
But in Scanpy, the BFP2,mTom, and mGFP don't appear in the var_names with the genes and TEs. And the TEs are slightly different than I see when using your mm10 index.
Any suggestions?
Thanks in advance!
The Readanno was changed in scTE
allelement, chr_list, all_annot, glannot = Readanno(filename=outname, annoglb=args.annoglb[0]) #genome=args.genome
but Readanno in scTEATAC was not modified as you can find in
Line 144 in d9a300e
Hi there,
Thanks for providing such wonderful pipeline.
I gave a try on the test.bam file in the Data folder, but got some error. Please check the details below.
Two command lines two used:
scTE_build -g mm10
scTE -i test.bam --min_genes 1 -o out.test -x mm10.exclusive.idx --hdf5 True -CB CR -UMI UR
And the log file got some error when 'Calculating expression':
INFO : Parameter list:
Sample = out.test
Reference annotation index = mm10.exclusive.idx
Minimum number of genes required = 1
Minimum number of counts required = None
Number of threads = 1
INFO : Loading the genome annotation index... 2021-07-14 13:43:59
INFO : Loaded 'mm10.exclusive.idx' binary file with 3900779 items
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '3', '4', '5', '6', '7', '8', '9', 'M', 'X', 'Y']
INFO : Finished loading the genome annotation index... 2021-07-14 13:52:37
INFO : Processing BAM/SAM files ...2021-07-14 13:52:37
INFO : Input SAM/BAM file appears to be valid
CR UR good
INFO : Done BAM/SAM files processing ...2021-07-14 13:52:52
INFO : Splitting ...2021-07-14 13:52:52
INFO : Executing single thread path
INFO : Finished processing sample files 2021-07-14 13:52:52
INFO : Fetching from the annotation index... 2021-07-14 13:52:52
INFO : Done fetching... 2021-07-14 13:52:54
INFO : Calculating expression... 2021-07-14 13:52:54
Traceback (most recent call last):
File "/home/setup/zhu/biotools/miniconda3/bin/scTE", line 4, in
import('pkg_resources').run_script('scTE==1.0', 'scTE')
File "/home/setup/zhu/biotools/miniconda3/lib/python3.7/site-packages/pkg_resources/init.py", line 666, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/setup/zhu/biotools/miniconda3/lib/python3.7/site-packages/pkg_resources/init.py", line 1462, in run_script
exec(code, namespace, namespace)
File "/home/setup/zhu/biotools/miniconda3/lib/python3.7/site-packages/scTE-1.0-py3.7.egg/EGG-INFO/scripts/scTE", line 169, in
main()
File "/home/setup/zhu/biotools/miniconda3/lib/python3.7/site-packages/scTE-1.0-py3.7.egg/EGG-INFO/scripts/scTE", line 155, in main
len_res, genenumber, filename = Countexpression(filename=args.out, allelement=allelement, genenumber=args.genenumber, cellnumber=args.cellnumber, hdf5=args.hdf5)
File "/home/setup/zhu/biotools/miniconda3/lib/python3.7/site-packages/scTE-1.0-py3.7.egg/scTE/base.py", line 505, in Countexpression
adata = ad.AnnData(np.asarray(data),var = var,obs = obs)
File "/home/setup/zhu/biotools/miniconda3/lib/python3.7/site-packages/anndata/_core/anndata.py", line 321, in init
filemode=filemode,
File "/home/setup/zhu/biotools/miniconda3/lib/python3.7/site-packages/anndata/_core/anndata.py", line 462, in _init_as_actual
_check_2d_shape(X)
File "/home/setup/zhu/biotools/miniconda3/lib/python3.7/site-packages/anndata/_core/anndata.py", line 97, in _check_2d_shape
f"X needs to be 2-dimensional, not {len(X.shape)}-dimensional."
ValueError: X needs to be 2-dimensional, not 1-dimensional.
Do you have any suggestions on this error?
Thank you very much!
Hi ! I have another question.
In the Figure 1e of your article, I understood that you performed a downstream analysis only based on the expression of TE.
Does it mean that you filtered the count table you got after running scTE to maintain only the rows with TE name (and exclude genes) before performing the downstream analysis (Scanpy/Seurat)?
Thanks for your answer!
Jihed
Hello!
I wanted to analyze the single cell RNA sequencing datasets of zebrafish, but I didn't know how to set the parameters and modify the codes based on zebrafish genome. Can you tell me how to change the code to analyze the datasets of zebrafish?
Thank you!
I filtered the bam files using this awk command:
samtools view possorted_genome_bam.bam -h | awk '/^@/ || /CB:/ && /UB:/' | samtools view -h -b > possorted_genome_bam.filtered.bam
then I do:
scTE -i possorted_genome_bam.filtered.bam -o out -x /home/lenail/scTE/hg38.exclusive.idx --hdf5 True -CB CB -UMI UB --thread 4
but I get this error:
DEBUG : Creating converter from 7 to 5
DEBUG : Creating converter from 5 to 7
DEBUG : Creating converter from 7 to 5
DEBUG : Creating converter from 5 to 7
INFO : Parameter list:
Sample = /net/bmc-lab5/data/kellis/users/lenail/PFC_aging/scTE/D19-4296/out
Reference annotation index = /home/lenail/scTE/hg38.exclusive.idx
Minimum number of genes required = 200
Minimum number of counts required = None
Number of threads = 4
INFO : Loading the genome annotation index... 2022-08-25 21:40:48
INFO : Loaded '/home/lenail/scTE/hg38.exclusive.idx' binary file with 4778929 items
INFO : Finished loading the genome annotation index... 2022-08-25 21:41:42
INFO : Processing BAM/SAM files ...2022-08-25 21:41:42
INFO : Input SAM/BAM file appears to be valid
INFO : Done BAM/SAM files processing ...2022-08-25 23:58:40
INFO : Splitting ...2022-08-25 23:58:40
INFO : Executing multiple thread path with 4 threads
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '3', '4', '5', '6', '7', '8', '9', 'M', 'X', 'Y']
CB UB good
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/lenail/.conda/envs/py39/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/lenail/.conda/envs/py39/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/home/lenail/.conda/envs/py39/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/scTE/base.py", line 366, in splitChr
CRs[t[3]] += 1
IndexError: list index out of range
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/lenail/.conda/envs/py39/bin/scTE", line 4, in <module>
__import__('pkg_resources').run_script('scTE==1.0', 'scTE')
File "/home/lenail/.conda/envs/py39/lib/python3.9/site-packages/pkg_resources/__init__.py", line 672, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/lenail/.conda/envs/py39/lib/python3.9/site-packages/pkg_resources/__init__.py", line 1472, in run_script
exec(code, namespace, namespace)
File "/home/lenail/.conda/envs/py39/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE", line 169, in <module>
main()
File "/home/lenail/.conda/envs/py39/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE", line 134, in main
pool.map(partial_work, chr_list)
File "/home/lenail/.conda/envs/py39/lib/python3.9/multiprocessing/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/lenail/.conda/envs/py39/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
IndexError: list index out of range
Any ideas?
when I tried to find differential expressed genes and TEs for different cell clusters using expression matrix containing both genes and TEs, I noticed that I got less differential expressed genes comparing with using expression matrix containing genes only (there is no difference with TEs). I assumed that the counts of TEs are much higher than genes, so some differential expressed genes whose counts is low cannot be found after normalization. Is my assumption resonable?How to normalize data properly? How to solve such a problem?
checkCBUMI
may not work as expected in
Line 109 in d9a300e
I find some record in bam that generated by CellRanger will not have CB:Z, so the result in testCR.txt will not equal to 100.
So, could you improve the algorithm of the checkCBUMI?
I have established a custom reference for hg19.but when I use scTE command, I can only choose -g hg38. Does this situation make a difference for my outputs ?
Hi Jiekai Lab - thank you for developing such great resouce!
I keep on having an issue to create custom genomes for scRNA analysis. Even when I run the example data with:
scTE_build -te ./Data/TE.bed -gene ./Data/Gene.gtf -o test.idx
I get the error: "Counting genome other not supported"
Could you please help me? I've ruled out installation issues as I don't have any other problems in reproducing the other parts of your pipeline (e.g., with scATAC or built in genomes)
Thank you
Ivan Ferreira
Sauka-Spengler Lab
Hi @jphe ,
I am trying to use scTE but get the following error:
DEBUG : Creating converter from 7 to 5
DEBUG : Creating converter from 5 to 7
DEBUG : Creating converter from 7 to 5
DEBUG : Creating converter from 5 to 7
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/anndata-0.8.0-py3.6.egg/anndata/compat/__init__.py", line 65, in <module>
from typing import Literal
ImportError: cannot import name 'Literal'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/scTE", line 4, in <module>
__import__('pkg_resources').run_script('scTE==1.0', 'scTE')
File "/home/asaera/.local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 650, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/asaera/.local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1446, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.6/dist-packages/scTE-1.0-py3.6.egg/EGG-INFO/scripts/scTE", line 13, in <module>
from scTE.base import *
File "/usr/local/lib/python3.6/dist-packages/scTE-1.0-py3.6.egg/scTE/base.py", line 16, in <module>
import anndata as ad
File "/usr/local/lib/python3.6/dist-packages/anndata-0.8.0-py3.6.egg/anndata/__init__.py", line 7, in <module>
from ._core.anndata import AnnData
File "/usr/local/lib/python3.6/dist-packages/anndata-0.8.0-py3.6.egg/anndata/_core/anndata.py", line 27, in <module>
from .raw import Raw
File "/usr/local/lib/python3.6/dist-packages/anndata-0.8.0-py3.6.egg/anndata/_core/raw.py", line 11, in <module>
from .aligned_mapping import AxisArrays, AxisArraysView
File "/usr/local/lib/python3.6/dist-packages/anndata-0.8.0-py3.6.egg/anndata/_core/aligned_mapping.py", line 11, in <module>
from ..utils import deprecated, ensure_df_homogeneous
File "/usr/local/lib/python3.6/dist-packages/anndata-0.8.0-py3.6.egg/anndata/utils.py", line 11, in <module>
from ._core.sparse_dataset import SparseDataset
File "/usr/local/lib/python3.6/dist-packages/anndata-0.8.0-py3.6.egg/anndata/_core/sparse_dataset.py", line 23, in <module>
from ..compat import _read_attr
File "/usr/local/lib/python3.6/dist-packages/anndata-0.8.0-py3.6.egg/anndata/compat/__init__.py", line 68, in <module>
from typing_extensions import Literal
File "/usr/local/lib/python3.6/dist-packages/typing_extensions-4.2.0-py3.6.egg/typing_extensions.py", line 159, in <module>
class _FinalForm(typing._SpecialForm, _root=True):
AttributeError: module 'typing' has no attribute '_SpecialForm'
The code for scTE is as follows
mainD='/media/path/to/folder/'
inF="${mainD}STARsolo/sample1/"
inBAM="${inF}Aligned.sortedByCoord.out.bam"
outDir="${mainD}scTE/sample1/"
mkdir -p $outDir
scTE -i $inBAM -o $outDir -x "${mainD}hg38.exclusive.idx" -CB CR -UMI UR
idx file was generated as follows
cd ${mainD}
scTE_build -g hg38
This is a 10X chromium V3 experiment and BAM file was generated with STARsolo, I can provide the code if you need it.
I made multiple test and I always get the same error as above. I tried parsing the BAM file to move Cell barcode and UMI from CR/UR to CB/UB, I tried using out for -o
as in the manual, I tried with both --hdf5 True
and --hdf5 False
but always get the same error.
Any idea or advice would be highly appreciated.
Thanks,
Hi,
Thank you for the resource! I am having trouble getting started using it. Mainly, I can't seem to understand how hg38.exclusive.idx was generated. I would greatly appreciate the help.
Hi,
Thanks for this software, it's great and intuitive to use.
I was trying to analyze the data from Deng et al, using the first sample as an example. I build the mm10 genome with
scTE_build -g mm10 -o /scratch/reference/scTE/mm10
Then I tried to analyze one sample. The fastq file were obtained from GEO and aligned using STAR
STAR --genomeDir /scratch/Databases/Mm10/STAR_genome/ \
--readFilesIn /scratch/singleCell/Mouse/GSE45719_Deng_2014/Fastq//SRR805173.fastq.gz \
--readFilesCommand gunzip -c --outFileNamePrefix /scratch/Deng/bam/SRR805173/SRR805173 \
--outSAMtype BAM SortedByCoordinate --outSAMattributes XS --runThreadN 16 \
--outFilterMultimapNmax 100
Then I tried to run scTE
scTE -i bam/SRR805173/SRR805173Aligned.sortedByCoord.out.bam -o counts_ -g mm10 \
-x /scratch/reference/scTE/mm10.exclusive.idx -p 16 --expect-cells 300 -CB False -UMI False
However, it is blocked for hours now after printing INFO: fetching from the annotation index 2021-03-17 16:05:27
Do you have any idea what could cause this?
Best
Hi, thanks for providing such a useful tool! I want to use it to analyze my single cell data. And in the example code ,3.diffexp.py, line 44 used the file 'TE_genes_id.mm10.txt' . I didn't find this file in the main text and supplementary.
I wonder how can I to get this file of hg38 to go through the next-step analysis?
Hi, @jphe
it's the best tool! I also encountered the same problem as #3,but the species is Macaca mulatta.
gene annotation file was downloaded from http://ftp.ensembl.org/pub/release-104/gtf/macaca_mulatta/Macaca_mulatta.Mmul_10.104.gtf.gz,
and repeatmask file was downloaded from http://hgdownload.soe.ucsc.edu/goldenPath/rheMac10/database/rmsk.txt.gz .
Then, I transformed the repeatmask file into a six-column bed file with the option awk 'BEGIN{FS=OFS="\t"}{print $6,$7,$8,$11,$3,$10}' rmsk.txt > mmul10rmsk.bed and make sure the chromosome name consistent with gene annotation file.
Lastly, I building the index scTE_build -te mmul10rmsk.bed -gene Macaca_mulatta.Mmul_10.104.gtf -o Mmul_10scTE.idx.
However, I get the ERROR : Counting genome other not supported.
Any tips are appreciated !
Thank you for your generous help!
Hi, the ReadMe file says "If you want to use your customs reference, you can use the -gene -te options:". We understood this as being able to use your code on other genomes than the mouse and the human. We tried this command to build the index:
scTE_build -te /path/to/hsal_v8.5_filtered_unique_ids.bed -gene /path/to/hsal_v8.5_genes_update16.gtf -o /path/to/scTE_build_1.idx
and we got the following error message:
scTE_build: error: the following arguments are required: -g/--genome
In the ReadMe file example the -g argument is not supplied for building a custom index. Why is it required? Any tips are appreciated. Thank you.
hello, JiekaiLab team
A grateful software for us to research TE. We can see scTE can be used to quantitate TEs reads base on Bulk RNA-seq data with the setting -CB False -UMI False. My question is if the scTE output value needs to correct with EDASeq software?
Thank you very much!
Kui Duan
Hello,
I'm trying to run this on my scATAC-seq sample and I'm running into this issue:
```TypeError: Readanno() got an unexpected keyword argument 'genome'``
I built the hg38 index using the same method you did with you mm10 genome on the README.
scTEATAC -i filtered.bam -x hg38.te.atac.idx -p 16 -CB -g hg38
Any ideas why this is an issue?
Thanks,
Chang
Hello,
I read your research paper on TE quantification in single-cell data. Very witty!
It prompted me to try out your pipeline, but it crushed with an error soooo close to an end. I looked the source code on git hub, and it seems I had found the source of an error.
At the line 155 where the Counterexpression
function is called, for a filename argument it takes args.out, which is an output path specified by the user. Throughout the pipeline the outname ( basename
of args.out
) is used, and all of the temporary files are saved to the working directory using the outname as a prefix for tmp folder.
Thus, when Counterexpression
is called with args.out
, it looks for temporary files following incorrect paths;
from base.py:
def Countexpression(filename, allelement, genenumber, cellnumber, hdf5):
gene_seen = allelement
whitelist={}
o = gzip.open('%s_scTEtmp/o4/%s.bed.gz'%(filename, filename), 'rt')
Temporarily changing args.out
to outname
in Counterexpression
function call solved this problem for me and pipeline ran smoothly. All the output files were saved to the directory from where the original script had been run.
This error only occurs if args.out is provided as a path. However, the error is not noticeable until the very end of the pipeline. I would suggest to change args.out
to outname
in Counterexpression
function call on the line 155 to prevent this unexpected behaviour
Hi, I'm trying to analyze single cell data and I expect more than 10,000 cells per sample. However, for these samples, they each say 10,000 cells detected in the output. Is there a cutoff at 10,000 or a way to get around this?
Thank you.
there should be a folder named samplename_tmp, and you can see the files under that folder
Originally posted by @jphe in #4 (comment)
Hi,
I'm wondering why I don't have the o3 and o4 folders under the tmp folder.
https://user-images.githubusercontent.com/52441289/114651476-a0599980-9d16-11eb-93e5-f2b8fe60ae09.png
Hi all, I am trying scTE on some scRNA-seq data of mine (hg38). I have BAM files generated with STARSolo and following the instructions I'm quantifying TE like this:
scTE -i ${SAMPLE}Aligned.sortedByCoord.out.bam -o ${SAMPLE}_TE -x hg38.exclusive.idx --hdf5 True -CB CR -UMI UR -p 8
All samples are being processed but in some log files I'm finding this message:
[…]
INFO : Loading the genome annotation index... 2021-11-18 11:13:28
INFO : Loaded '/beegfs/scratch/ric.cosr/cittaro.davide/Ref/scTE/hg38/hg38.exclusive.idx' binary file with 4779764 items
INFO : Finished loading the genome annotation index... 2021-11-18 11:14:06
INFO : Processing BAM/SAM files ...2021-11-18 11:14:06
INFO : Input SAM/BAM file appears to be valid
sed: couldn't write 50 items to stdout: Broken pipe
sed: couldn't write 53 items to stdout: Broken pipe
sed: couldn't write 58 items to stdout: Broken pipe
awk: cmd. line:1: (FILENAME=- FNR=131567431) fatal: print to "standard output" failed (Broken pipe)
The forked process
samtools view -@ 8 HCT116_FOLFIRI_LTAligned.sortedByCoord.out.bam | awk '{OFS="?"}{for(i=12;i<=NF;i++)if($i~/CR:Z:/)n=i}{for(i=12;i<=NF;i++)if($i~/UR:Z:/)m=i}{print $3,$4,$4+100,$n,$m}' | sed -r 's/CR:Z://g' | sed -r 's/UR:Z://g'| sed -r 's/^chr//g' | awk '!x[$4$5]++' | gzip -c > HCT116_FOLFIRI_LT_TE_scTEtmp/o1/HCT116_FOLFIRI_LT_TE.bed.gz
is still running (apparently) but, compared to other processes launched in the same moment, it seems I'm stuck in generating the content of o1
folder. Any hint?
Very nice work! Now I have two bam files from the cellranger (possorted_genome_bam.bam), how can I merge those two bam files with scTE, AND because I have run the Seurat before scTE, how to transfer the analysis results from Seurat like cell ananation and cell emmbeding to scTE.THANKs
Hi.
I am so confused about the KeyError: 'M'. I've used it with the sam bam file, however, now it report the error.I tried few times and reinstall the git. The bug still exits.
DEBUG : Creating converter from 7 to 5
DEBUG : Creating converter from 5 to 7
DEBUG : Creating converter from 7 to 5
DEBUG : Creating converter from 5 to 7
INFO : Parameter list:
Sample = out
Reference annotation index = /home/data/xxxx/xxxxxx/scte/macFas5.exclusive.idx
Minimum number of genes required = 200
Minimum number of counts required = None
Number of threads = 1
INFO : Loading the genome annotation index... 2022-03-08 12:10:31
INFO : Loaded '/home/data/xxxxx/xxxxx/scte/macFas5.exclusive.idx' binary file with 4428688 items
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '3', '4', '5', '6', '7', '8', '9', 'X']
INFO : Finished loading the genome annotation index... 2022-03-08 12:11:28
INFO : Processing BAM/SAM files ...2022-03-08 12:11:28
INFO : Input SAM/BAM file appears to be valid
CR UR good
INFO : Done BAM/SAM files processing ...2022-03-08 13:26:22
INFO : Splitting ...2022-03-08 13:26:22
INFO : Executing single thread path
Traceback (most recent call last):
File "/home/data/xxxx/miniconda3/envs/xxxxxxx/bin/scTE", line 4, in
import('pkg_resources').run_script('scTE==1.0', 'scTE')
File "/home/data/xx/miniconda3/envs/xxxxxx/lib/python3.8/site-packages/pkg_resources/init.py", line 651, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/data/xxx/miniconda3/envs/xxxxxx/lib/python3.8/site-packages/pkg_resources/init.py", line 1448, in run_script
exec(code, namespace, namespace)
File "/home/data/xxx/miniconda3/envs/xxxxx/lib/python3.8/site-packages/scTE-1.0-py3.8.egg/EGG-INFO/scripts/scTE", line 169, in
main()
File "/home/data/xxxxx/miniconda3/envs/xxxxxx/lib/python3.8/site-packages/scTE-1.0-py3.8.egg/EGG-INFO/scripts/scTE", line 129, in main
whitelist = splitAllChrs(chr_list, filename=outname, genenumber=args.genenumber, countnumber=args.countnumber, UMI=args.UMI)
File "/home/data/xxxxxxxx/miniconda3/envs/xxxxxx/lib/python3.8/site-packages/scTE-1.0-py3.8.egg/scTE/base.py", line 260, in splitAllChrs
if line in uniques[chrom]:
KeyError: 'M'
How can I fix it?
Thanks and best wishes,
Chris
Hello,
Thanks for scTE!
I did run it in my mouse scRNA seq data, however, I haven't been able to read the output file... it is too big!
I have a .csv of 60gb for each of my samples and I can't read them neither using a cluster with 500gb of RAM.
As I just want to analyze the expression of transposable elements (specifically the ERVs), I wondered if there is a way to obtain the counts for the TEs only and/or to filter out the genes and TEs with 0 counts, in order to get a smaller output file?
(when I tried to run it with --hd5f True, it didn't work because it went out if memory even with --mem=200G)
thanks a lot for your help!
Javiera.
Hello,
Thanks for the scTE at first, but I met problem when I use scTE, and I guess this is because the parameters of "-CB" and "-UMI".
the error is: scTE: error: argument -CB: invalid choice: 'CB' (choose from 'True', 'False')
My commands are:
scTE -i /Volumes/Backup\ Plus/F19FTSSCWLJ0261_10X/bam/F03_PBS_MG.possorted_genome_bam.bam -o /Volumes/Backup\ Plus/F19FTSSCWLJ0261_10X/bam/F03_PBS_MG.possorted_genome_bam_scTE -x mm10.exclusive.idx --hdf5 True -CB CB -UMI UB -g mm10
and part of my bam files are:
E00490:510:HYC2TCCXY:5:2105:25022:45646 256 1 3044966 0 30S121M * 0 0 AAGCAGTGGTATCAACGCAGAGTACATGGGAGAGAAAAACAAACCTGGGTATGCCTCGTAGTTAAAACATTCCTGGGAACATCTTGACCATAAGATAAAGGGGACTGTGAAGACATAGCAGGGCTATATTATCTAAGTCAACACCATCTGG AAFFFJFJJJJJJJJJJJJJJJJJJJJJJJJJJFAFJJFFJFFFJJJFFAFJFFJ7FFFJJJJJJFJJJJJFJJJJFJJJJJJJJFJJJJF7FJFJF-JJFJJ-<A<AFJJJJJJJFJJFJJJJJJ7AFFJ<FJAAAAAF<JJJJJFJJJJ NH:i:6 HI:i:3 AS:i:119 nM:i:0 NM:i:0 CR:Z:GCGCCAACATTGAGCT CY:Z:AAAFFFJJJJJJJJJJ CB:Z:GCGCCAACATTGAGCT-1 UR:Z:ATAGCAAGCA UY:Z:JJJJJJJFFJ UB:Z:ATAGCAAGCA BC:Z:CCTTTGTC QT:Z:AAFFFJJJ TR:Z:TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCATTACAAGAGGGCGTTTAATTTCGGGGGTTCAAACCGACATTCATCCCACACAAGAGATAATGGGAGTCGGCCACGGGGGCCGCAAGACAGGG TQ:Z:JJJJJFJJJJJJJJJJJJJJJJJJJJJJJ<--<-A--<----<)---AF--A---7<)))))7----7--))7--7-7<F<))---7-<-7<F--7--))7)<))))-)--)-)-))<A----<F RG:Z:F03_PBS_MG:MissingLibrary:1:HYC2TCCXY:5
I saw in your Readme.txt file that the value of parameter "-CB" and "-UMI" are "CB" and "CR", but the recommend value in my terminal shows only can choose "False" or "True"....
I'm looking forward to your reply, Thanks a lot!
Xiaoyu
Hi,
I have a problem running the command 'scTEATAC': When runnning
scTEATAC -i input.bam -o outfile -x mm10.te.atac.idx
I receive the error
scTE_scatacseq: error: the following arguments are required: -g/--genome
However, when running the command scTEATAC -i input.bam -o outfile -x mm10.te.atac.idx -g mm10
another error occurs:
TypeError: Readanno() got an unexpected keyword argument 'genome'
Could you please provide an example on how to run scTE on the output from Cellranger-atac (10x scATAC-seq data)? The example provided in the README does not work for me.
Thanks and best wishes,
Malte
Hello! I am trying to use scTE on BAM files generated from STARsolo. The tool gets to the point where it says that the BAM files look good but does not progress. I have ensured that I have installed the latest version = 1.0 from the JiekaiLab git and have tried adjusting the -p from 1 to 8 with no success. Below is the output from the tool, any advice on how to fix this would be appreciated!
/wynton/home/greenelab/mkinisu/scripts/scTE_virtualenv/bin/python
DEBUG : Creating converter from 7 to 5
DEBUG : Creating converter from 5 to 7
DEBUG : Creating converter from 7 to 5
DEBUG : Creating converter from 5 to 7
INFO : Parameter list:
Sample = /wynton/scratch/mkinisu/Weili_HIV_STAR/Weili_HIV/WKWG04_C1_S9_L004
Reference annotation index = /wynton/home/greenelab/mkinisu/ref/scTE/hg38.exclusive.idx
Minimum number of genes required = 200
Minimum number of counts required = None
Number of threads = 8
INFO : Loading the genome annotation index... 2022-03-10 15:51:28
INFO : Loaded '/wynton/home/greenelab/mkinisu/ref/scTE/hg38.exclusive.idx' binary file with 4750078 items
INFO : Finished loading the genome annotation index... 2022-03-10 15:52:31
INFO : Processing BAM/SAM files ...2022-03-10 15:52:31
INFO : Input SAM/BAM file appears to be valid
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.