jiekailab / scte Goto Github PK

View Code? Open in Web Editor NEW

87.0 4.0 27.0 23.46 MB

License: MIT License

Shell 0.01% Python 0.52% Jupyter Notebook 99.47%

scte's People

Contributors

Stargazers

Watchers

scte's Issues

Not getting significant read counts

Hi there,

I've been trying to run scTE on a dataset of human PBMCs. I ran CellRanger count on the FASTQs, got a bam file, created a human genome index with scTE_build using the same genes.gtf file that was used in the CellRanger reference package, then ran scTE on that bam file.

The results from scTE said that there were 1172 cells detected expressing at least 200 genes, and the results from running CellRanger count also indicated that there were an estimated 3500 genes expressed per cell. However, the csv file had almost no reads for the genes. Only 4 genes were found to have read counts of over 100 within some cells, which doesn't seem like it would be enough to separate the cells into clusters.

Attached below are some screenshots illustrating what I ran, the CellRanger count output, as well as a screenshot of my csv file. Just wondering if you had any input on this issue. Thank you in advance!

metrics_summary_new.xlsx

TE file and species issue

Hi , thanks for your scTE !!
Now I want to use the scTE_build command to generate the annotation file, but my TE file is the repeatmasker generated .out file. I have noticed that TE needs to be in bed format, but I am not sure which columns of files in the.out file are needed. In addition, I wonder if single cell data analysis with some non-model animals is feasible?

IndexError: list index out of range

Hi,

I ran scTE with the following options, and it threw an IndexError. I am pasting the details below. Could you please help?
Python version is 3.9. mm39 index was successfully created with scTE_build, I had included the option in your code. A bam file generated with cellranger v7.0 was used as input.

================
$ scTE -i input.bam -o out -x mm39.exclusive.idx -p 80 -CB CB -UMI UB
DEBUG : Creating converter from 7 to 5
DEBUG : Creating converter from 5 to 7
DEBUG : Creating converter from 7 to 5
DEBUG : Creating converter from 5 to 7
INFO : Parameter list:
Sample = out
Reference annotation index = mm39.exclusive.idx
Minimum number of genes required = 200
Minimum number of counts required = None
Number of threads = 80

INFO : Loading the genome annotation index... 2022-08-16 16:32:08
INFO : Loaded 'mm39.exclusive.idx' binary file with 4018326 items
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '3', '4', '5', '6', '7', '8', '9', 'M', 'X', 'Y']
INFO : Finished loading the genome annotation index... 2022-08-16 16:32:54

INFO : Processing BAM/SAM files ...2022-08-16 16:32:54
INFO : Input SAM/BAM file appears to be valid
CB UB good

INFO : Done BAM/SAM files processing ...2022-08-16 16:59:20

INFO : Splitting ...2022-08-16 16:59:20
INFO : Executing multiple thread path with 80 threads
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/scTE/base.py", line 366, in splitChr
CRs[t[3]] += 1
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/bin/scTE", line 4, in
import('pkg_resources').run_script('scTE==1.0', 'scTE')
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/lib/python3.9/site-packages/pkg_resources/init.py", line 665, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/lib/python3.9/site-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE", line 169, in
main()
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE", line 134, in main
pool.map(partial_work, chr_list)
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/lib/python3.9/multiprocessing/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/sam/softwares/anaconda2/envs/scTE_python3.9/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
IndexError: list index out of range

================

Thanks,
Sam

Unable to recognize barcode information

Hi there,

I've been trying to run scTE on the provided test BAM file, however using the parameters -CB CR -UMI UR gives the error message that the bam file has no cell barcodes information and that I should set CB and UMI to False. Prior to running scTE, I've loaded the virtual environment from conda to ensure I have all the required packages installed.

The command that I'm entering is
$ scTE -i test.bam -o test -x mm10.exclusive.idx --hdf5 False -CB CR -UMI UR

With the error message being ERROR : The input file /scratch/st-mtokuy01-1/awang00/scTE/Data/test.bam has no cell barcodes information, plese make sure the aligner have add the cell barcode key, or set CB to False

Any advice on what I'm doing wrong would be appreciated. Thanks

EDIT: I was missing samtools, so once I got that loaded in, then it was able to recognize the CR:Z and UR:Z tags. However, just like #48, I'm receiving the message INFO : Detect 0 cells expressed at least 200 genes, results output to test.csv.

New release version required

Hello,

Could you generate package release as per the version number on github, so we can download as per specific versions. I can see the latest version available is v1.0. But there is a difference in the package the one we tried "git clone" and the tar file.

Please let us know

Thanks
Jay

Issues building an index for a custom genome (Turquoise killifish)

Hi - thank you for creating such a unique/useful tool!
After testing the pipeline on one of the "out of the box" supported genomes (mm10) successfully, we are trying to use it to measure TE expression in single cells for the African turquoise killifish. We are using the NCBI genome annotation files (gtf) and ran repeatmasker to get the TE bed file. Everything looks ok based on the sample files provided here on Github.
We have the most recent version of the package, and we use this command to build the index:

scTE_build -te 2015_Genome_scTE.bed -gene GCF_001465895.1_Nfu_20140520_genomic_CLEAN_MT_exon.filtered.gtf -o Nfu_20140520 -g other

Although the run starts ok, we get a warning almost immediately that does not stop the run:

/usr/local/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE_build:110: DeprecationWarning: 'U' mode is deprecated
  o = open(tefile,'rU')

However, after a few minutes, the program aborts with the following error:

Traceback (most recent call last):
  File "/usr/local/bin/scTE_build", line 4, in <module>
    __import__('pkg_resources').run_script('scTE==1.0', 'scTE_build')
  File "/usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 651, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 1448, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE_build", line 468, in <module>
    main()
  File "/usr/local/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE_build", line 461, in main
    genomeIndex(args.genome,args.mode,tefile,genefile, args.out,'No path','No path')
  File "/usr/local/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE_build", line 127, in genomeIndex
    gls.load_list(clean)
  File "/usr/local/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/scTE/miniglbase/genelist.py", line 1472, in load_list
    list_to_load[0]
IndexError: list index out of range

Would you be able to assist in figuring out what the problem is? We are very excited to use the package on the African turquoise killifish data but won't be able to until we can have an index...
Thank you so much for your help in advance!

Stuck while quantifying!

Hi,

Thanks for scTE.

I am trying to run scTE on 4 samples using the following command :

#!/bin/bash

scTE -i *.bam -o out -x /home/urangasw/Softwares/scTE/out.inclusive.idx --min_genes 100 --min_counts 400 -p 16

I use to following parameters to submit the job :

sbatch --ntasks=1 --cpus-per-task=32 --mem=90000mb --partition=long1 --time=48:00:00 --qos=fastlane temp.sh
It's been more than a day now and the programs seems to be stuck at :

INFO    : Parameter list:
Sample = out
Reference annotation index = /home/urangasw/Softwares/scTE/out.inclusive.idx
Minimum number of genes required = 100
Minimum number of counts required = 400
Number of threads = 16

INFO    : Loading the genome annotation index... 2021-12-26 11:45:22
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Finished loading the genome annotation index... 2021-12-26 11:46:20

INFO    : Processing BAM/SAM files ...2021-12-26 11:46:20
INFO    : Input SAM/BAM file appears to be valid
INFO    : Using parabam2bed as more than 1 input BAM
['1', '10', '10_GL383545V1_ALT', '10_GL383546V1_ALT', '10_KI270824V1_ALT', '10_KI270825V1_ALT', '10_KN196480V1_FIX', '10_KN538365V1_FIX', '10_KN538366V1_FIX', '10_KN538367V1_FIX', '10_KQ090020V1_ALT', '$
sed: couldn't write 49 items to stdout: Broken pipe
sed: couldn't write 49 items to stdout: Broken pipe
sed: couldn't write 54 items to stdout: Broken pipe
awk: cmd. line:1: (FILENAME=- FNR=347939132) fatal: print to "standard output" failed (Broken pipe)
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1
sed: couldn't write 49 items to stdout: Broken pipe
sed: couldn't write 49 items to stdout: Broken pipe
sed: couldn't write 54 items to stdout: Broken pipe
awk: cmd. line:1: (FILENAME=- FNR=538652353) fatal: print to "standard output" failed (Broken pipe)
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1
INFO    : Done BAM/SAM files processing ...2021-12-26 15:14:43

INFO    : Splitting ...2021-12-26 15:14:44
INFO    : Executing multiple thread path with 16 threads
UR CR
UR CR
UR CR
UR CR
INFO    : Finished processing sample files 2021-12-26 18:12:14

INFO    : Fetching from the annotation index... 2021-12-26 18:12:14
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items
INFO    : Loaded '/home/urangasw/Softwares/scTE/out.inclusive.idx' binary file with 5971706 items

Each of the 4 BAM files are approx 40GB in size. Please let me know what I should change to run the program successfully. Thanks in advance!

Error when running scATAC data

Hi,

When I tried to run the command: $ scTEATAC_build -g mm10.te.bed -o mm10.te.atac, I met an error that says: -bash: scTE_scatacseq: command not found. It seems that scatacseq is not included in your repository. Could you please give a guide about how to run scTE on the scATAC data in more details?

Thank you in advance.

Best wishes,
Apollo

TE sequences for motif enrichment

Thank you for this great tool!

For motif enrichment analyses within TEs, would you recommend using the consensus sequence from Dfam? Or is there another method you use to obtain the TE sequence, especially since expression is not on a locus level? Thank you!

Error using bam file from CellRanger

Hi,
While trying to use this very nice package but I encounter an error apparently due to the bam file I use
I am using cellranger/5.0.1 to generate the bam file
and get the error while running scTE:
ERROR : The input file /scratch/Maize_Primary_Align/outs/possorted_genome_bam.bam has no cell barcodes information, plese make sure the aligner have add the cell barcode key, or set CB to False

I have made sure that my option were set as --hdf5 True -CB CB -UMI UB

I noticed that my bam file has extra info in it compare to the example on the scTE github page
especially 4 additional info between RG: (RG:Z:Maize_Primary_Align:0:1:H5GLJDRX2:2) and RE (RE:A:E)
they are:
TX:Z:Zm00001d027288_T001,+617,91M | GX:Z:Zm00001d027288 | GN:Z:Zm00001d027288 | fx:Z:Zm00001d027288

Do you think that could be the issue of the error I am getting ?

For the rest I have all the info including CB and UB so the cell barcodes are in the bam file.

What would be the best way to remove these additional info ?
Thank you to all users in advance,

Failure to parse cell barcodes from bam files

scTE fails to find cell barcode information in bam files I generated using the cell ranger pipeline:

$ scTE -i possorted_genome_bam.bam -o out_rep2 -x /software/scTE/mm10.exclusive.idx --hdf5 True -CB CB -UMI UB
  DEBUG   : Creating converter from 7 to 5
  DEBUG   : Creating converter from 5 to 7
  DEBUG   : Creating converter from 7 to 5
  DEBUG   : Creating converter from 5 to 7
  INFO    : Parameter list:
  Sample = out_rep2
  Reference annotation index = /software/scTE/mm10.exclusive.idx
  Minimum number of genes required = 200
  Minimum number of counts required = None
  Number of threads = 1
  
  INFO    : Loading the genome annotation index... 2021-04-02 18:14:28
  INFO    : Loaded '/software/scTE/mm10.exclusive.idx' binary file with 3900779 items
  ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '3', '4', '5', '6', '7', '8', '9', 'M', 'X', 'Y']
  INFO    : Finished loading the genome annotation index... 2021-04-02 18:15:01
  
  INFO    : Processing BAM/SAM files ...2021-04-02 18:15:01
  ERROR   : The input file possorted_genome_bam.bam has no cell barcodes information, plese make sure the aligner have add the cell barcode key, or set CB to False

The bam files have CB and UB flags:

$ samtools view possorted_genome_bam.bam | head
	A00521:52:HHVH7DMXX:1:2126:5737:27273   16      chr1    3000239 255     91M     *       0       0       TTTCATCCAGGTTTTCCTGGTTTTTTTTTAGTATAGCCTTTCATAGTAGAATCTGATGATGTTTTTGATATCCTCATGTTCTGTTGTTATG     FFFF:FFFFFFFF:FFFFFFFFFFFFF,FFFFF,FFFFF:FFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF      NH:i:1  HI:i:1  AS:i:85 nM:i:2  RG:Z:Rachel_control_rep2:0:1:HHVH7DMXX:1        RE:A:I  xf:i:0  CR:Z:ATTATCCCAGTATGCT   CY:Z:FFFFFFFFFFFFFFFF   CB:Z:ATTATCCCAGTATGCT-1 UR:Z:AGGTCCACTT UY:Z:FFFFFFFFFF UB:Z:AGGTCCACTT
	A00521:52:HHVH7DMXX:1:2126:5936:27398   16      chr1    3000239 255     91M     *       0       0       TTTCATCCAGGTTTTCCTGGTTTTTTTTTAGTATAGCCTTTCATAGTAGAATCTGATGATGTTTTTGATATCCTCATGTTCTGTTGTTATG     FF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF      NH:i:1  HI:i:1  AS:i:85 nM:i:2  RG:Z:Rachel_control_rep2:0:1:HHVH7DMXX:1        RE:A:I  xf:i:0  CR:Z:ATTATCCCAGTATGCT   CY:Z:FFFFFFFFFFFFFFFF   CB:Z:ATTATCCCAGTATGCT-1 UR:Z:AGGTCCACTT UY:Z:FFFFFFFFFF UB:Z:AGGTCCACTT
	A00521:52:HHVH7DMXX:1:1470:6668:19617   16      chr1    3000373 255     91M     *       0       0       TATGCCCTCTAGTTAGTCTGGCTAAGGGTTTATCTATCTTGTTGACTTTCTCAAAGAACCAGCTACTAGTTTGGTTGATTCTTTGAATATT     FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF      NH:i:1  HI:i:1  AS:i:87 nM:i:1  RG:Z:Rachel_control_rep2:0:1:HHVH7DMXX:1        RE:A:I  xf:i:0  CR:Z:ACTTACTCACAGGCCT   CY:Z:FFFFFFFFFFFFFFFF   CB:Z:ACTTACTCACAGGCCT-1 UR:Z:TGGTGTTGGT UY:Z:FFFFFFFFFF UB:Z:TGGTGTTGGT
	A00521:52:HHVH7DMXX:2:1410:7952:4460    16      chr1    3009349 1       1S65M25S        *       0       0       GTTTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGTGAACTAACCCATGTACTCTGCGTTGATACCAC     FFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF      NH:i:3  HI:i:1  AS:i:59 nM:i:2  ts:i:25 RG:Z:Rachel_control_rep2:0:1:HHVH7DMXX:2        RE:A:I  xf:i:0  CR:Z:TAGACCATCCGATATG   CY:Z:FFF:FFFFFFFFFFFF   CB:Z:TAGACCATCCGATATG-1 UR:Z:AACTATAACG UY:Z:FFFFFFFFFF UB:Z:AACTATAACG

sh: samtools 未找到命令

when I use scTE, I got this message:sh: samtools 未找到命令
And I got an empty matrix . I could use samtools in my cluster normally

TE families and classes

Hi!
Thank you for the pipeline, it works really well with my 10X mouse single cell RNA-seq data.

I have a question concerning the downstream analysis, the build you have for the mouse TE contains meta-genes based on the name of the TE such as "IAPEz.int" or "IAPLTR1a−Mm".

I would like to know if you have a solution to show the overall expression of an entire family or class of TE, such as LTR or ERV.

I guess I would require to build a custom index using the column from the class and family rather than name in the bed file. I wanted to know if you have thought of that feature?

Thanks again!

scATAC:TypeError: align() got an unexpected keyword argument 'CB'

Hi,

I ran scTEATAC with the following options, and it threw an IndexError. I am pasting the details below. Could you please help?

$python ./bin/scTEATAC -i CD_ICA.bam -x hg38.te.atac.idx -o out_atac
INFO : Arguments:
INFO : out: out_atac
INFO : index: hg38.te.atac.idx

INFO : Minimum number of counts required = 1000
INFO : Number of threads = 1
INFO : Loading the genome annotation index... 2022-10-07 17:43:47
INFO : Loaded 'hg38.te.atac.idx' binary file with 4005310 items
INFO : Finished loading the genome annotation index... 2022-10-07 17:44:27

INFO : Processing BAM/SAM files ...2022-10-07 17:44:27
*****WARNING: Query GTGTCAAAGGCAGTAC:150385822#0 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.
(many warning)
*****WARNING: Query AGGACGATCGGTAGGA:188109048#0 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.
INFO : Done BAM/SAM files processing ...2022-10-07 17:55:22

INFO : Splitting ...2022-10-07 17:55:22
INFO : Executing single thread path
INFO : Finished processing sample files 2022-10-07 17:55:37

INFO : Fetching from the annotation index... 2022-10-07 17:55:37
Traceback (most recent call last):
File "./bin/scTEATAC2", line 216, in
main()
File "./bin/scTEATAC2", line 185, in main
align(chr=chrom, filename=outname, all_annot=None, glannot=glannot, whitelist=whitelist, CB=args.CB)
TypeError: align() got an unexpected keyword argument 'CB'

================
We changed some code in scTEATAC for the following error:
'''
$python ./bin/scTEATAC -i CD_ICA.bam -x hg38.te.atac.idx -o out_atac
usage: scTE_scatacseq [-h] [--ondisk] [--min_counts INT] [-CB [{True,False}]] [-UMI [{True,False}]]
[--ignoreDuplicates [{True,False}]] [--keeptmp [{True,False}]] [-p INT] [--hdf5 [{True,False}]] -i INPUT
[INPUT ...] -o [OUT] -x ANNOGLB [ANNOGLB ...] -g [genome]
scTE_scatacseq: error: the following arguments are required: -g/--genome
'''
delete 'genome=args.genome' in line 144
delete line93 and line 94

Thanks
Xia

How to modify the raw fastq file from MARS-seq as in the 10x scRNA-seq format ?

Hi.
In your publication :"Analysis of Alzheimer’s disease scRNA-seq data. The MARS-seq scRNA-seq raw data were download from GSE9896971. The raw fastq ﬁle were modiﬁed using custom scripts to embed the cell barcode and UMI in the same read, as in the 10x scRNA-seq format. " What's the custom scripts?My raw data is from STRT-seq, barcode is in read1 and UMI is in read2, do them need to be modified to the 10x scRNA-seq format?

Issues post-processing out.h5ad file from custom genome

Hi,

I have run scTE on custom Zea mays genome (known as Corn), using a bam file from 10x genomic CellRanger, giving me about 5000 good cells.
Everything went smoothly but:

1- I cannot see any TE in the out.h5ad file.

2- Only like 15% of the genes I detected using CellRanger are present in the out.h5ad, and only the genes with a gene name. Plant genome are badly annotated and only a few genes get a name, even sometimes you get the same name for two separate genes ^^. Only the gene ID should be used in plant genomes.
Is there a way to ensure that only the gene ID is used and not the gene name ?
examples of the last column of the Zea mays gtf file from Plant.ensemble:
gene with name:
gene_id "Zm00001d048603"; gene_name "GRAS-transcription factor 83"; gene_source "gramene"; gene_biotype "protein_coding";
ex gene without name:
gene_id "Zm00001d027230"; gene_source "gramene"; gene_biotype "protein_coding";
Should I modify the gtf manually ?

3- Could you precise what are the 6 columns of the TE bed file that need to be included?
In the UCSC website cited in the tutorial the definition of a bed file is "BED lines have three required fields and nine additional optional fields;
1- chrom - The name of the chromosome
2- chromStart - The starting position of the feature in the chromosome
3- chromEnd - The ending position of the feature in the chromosome"

What are the three other columns necessary for scTE ?
Does one of the TE need a gene_name "TExxxx" ?

As an example xenopus in the tutorial (https://hgdownload.soe.ucsc.edu/goldenPath/xenTro9/database/rmsk.txt.gz) has like 17 columns.

Finally, I converted the out.h5ad format to an Seurat Object using SeuratDisk package

Convert("out.h5ad", dest = "h5seurat", overwrite = TRUE)
pbmc3k <- LoadH5Seurat("out.h5seurat")

Is that a good way to do so ?
Is there an easier way to get the expression matrix from the h5ad?

Thank you very much in advance for your help,

Bruno

Detect 0 cells expressed at least 200 genes, results output to out.csv

Sample = out
Reference annotation index = mm10.exclusive.idx
Minimum number of genes required = 200
Minimum number of counts required = None
Number of threads = 1

INFO : Loading the genome annotation index... 2022-11-17 10:34:45
INFO : Loaded 'mm10.exclusive.idx' binary file with 3900779 items
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '3', '4', '5', '6', '7', '8', '9', 'M', 'X', 'Y']
INFO : Finished loading the genome annotation index... 2022-11-17 10:35:33

INFO : Processing BAM/SAM files ...2022-11-17 10:35:33
INFO : Input SAM/BAM file appears to be valid
CR UR good

sh: 1: gzip: Exec format error
INFO : Done BAM/SAM files processing ...2022-11-17 10:35:39

INFO : Splitting ...2022-11-17 10:35:39
INFO : Executing single thread path
INFO : Finished processing sample files 2022-11-17 10:35:39

INFO : Fetching from the annotation index... 2022-11-17 10:35:39
sh: 1: gzip: Exec format error
/usr/bin/gunzip: 57: exec: gzip: Exec format error
INFO : Done fetching... 2022-11-17 10:35:39

INFO : Calculating expression... 2022-11-17 10:35:39
INFO : Detect 0 cells expressed at least 200 genes, results output to out.csv
INFO : Finished calculating expression 2022-11-17 10:35:39
INFO : Done with 0d 0h 0m 53s

I used test.bam to run it. But the results showed detect 0 cells expressed.
Does it mean still not working?

Hidden Dependency on SAMtools and Question Regarding STARsolo Runs

It appears the code base has a hidden dependency on samtools. Every run I attempted produced the following error:

bam has no cell barcodes information, plese make sure the aligner have add the cell barcode key, or set CB to False

untill I installed samtools on my system. Only then was I able to pass the CB/UMI check. It looks like samtools is being called from here among other places.

You may want to add this to your installation instructions so that users are aware.

I also had a question regarding the recommended settings for running STARsolo. In the readme, you recommend -CB CR -UMI UR for STARsolo but recommend -CB CB -UMI UB for Cell Ranger. Is there any reason you recommend raw/uncorrected tags for STARsolo but corrected tags for Cell Ranger? In general which tags are best to use?

STAR solo include the read 'CR:Z' or 'UR:Z' tags

Hi thank you for the nice pipeline, I liked your article a lot!

Could you share with me an example of Star solo mapping of sc-atac-seq data to have the 'CR:Z' or 'UR:Z' tags in the bam file?

I have tried the following on the 10Kpbmc sc-atac-seq data (10X example dataset) with STAR 2.7.8a:

STAR  --genomeDir $genomedir
      --readFilesIn atac_pbmc_10k_v1_S1_L001_R3_001.fastq.gz,atac_pbmc_10k_v1_S1_L002_R3_001.fastq.gz \
      atac_pbmc_10k_v1_S1_L001_R1_001.fastq.gz,/atac_pbmc_10k_v1_S1_L002_R1_001.fastq.gz\
      --runRNGseed 42 --runThreadN 12 --readFilesCommand zcat \
--outFilterMultimapNmax 100 --winAnchorMultimapNmax 100 --outSAMmultNmax 1 --outSAMtype BAM SortedByCoordinate --twopassMode Basic --outWigType wiggle --outWigNorm RPM\
      --soloType CB_UMI_Simple \
      --soloCBwhitelist 737K-august-2016.txt \
      --soloBarcodeReadLength 0

This is what I could understand from the star solo documentation but it's wrong because the bam file has empty values for
'CR:Z' or 'UR:Z'.

samtools view Aligned.sortedByCoord.out.bam | head -1
A00519:269:H7FM2DRXX:2:2137:17978:8860	0	chr1	3004633	255	1S48M	*	0	0	GCCTAGAATATTATGCCCAACAAAACTATCTTTCAGAAATGAAGGAGAA	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF	NH:i:1	HI:i:1	AS:i:33	nM:i:7

Thanks in advance!

Jihed

ValueError: X needs to be 2-dimensional, not 1-dimensional.

Hi,

I used the test.bam in the folder to run
scTE -i test.bam -o out -x mm10.exclusive.idx --hdf5 True -CB CR -UMI UR

And I got this problem. How could I to figure out this problem, please?
Thank you.

Sample = out
Reference annotation index = mm10.exclusive.idx
Minimum number of genes required = 200
Minimum number of counts required = None
Number of threads = 1

INFO : Loading the genome annotation index... 2022-11-01 12:23:40
INFO : Loaded 'mm10.exclusive.idx' binary file with 3900779 items
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '3', '4', '5', '6', '7', '8', '9', 'M', 'X', 'Y']
INFO : Finished loading the genome annotation index... 2022-11-01 12:24:30

INFO : Processing BAM/SAM files ...2022-11-01 12:24:30
INFO : Input SAM/BAM file appears to be valid
CR UR good

sh: 1: gzip: Exec format error
INFO : Done BAM/SAM files processing ...2022-11-01 12:24:36

INFO : Splitting ...2022-11-01 12:24:36
INFO : Executing single thread path
INFO : Finished processing sample files 2022-11-01 12:24:36

INFO : Fetching from the annotation index... 2022-11-01 12:24:36
sh: 1: gzip: Exec format error
/usr/bin/gunzip: 57: exec: gzip: Exec format error
INFO : Done fetching... 2022-11-01 12:24:36

INFO : Calculating expression... 2022-11-01 12:24:36
Traceback (most recent call last):
File "/home/zhaolab/anaconda3/bin/scTE", line 4, in
import('pkg_resources').run_script('scTE==1.0', 'scTE')
File "/home/zhaolab/anaconda3/lib/python3.9/site-packages/pkg_resources/init.py", line 656, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/zhaolab/anaconda3/lib/python3.9/site-packages/pkg_resources/init.py", line 1453, in run_script
exec(code, namespace, namespace)
File "/home/zhaolab/anaconda3/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE", line 169, in
main()
File "/home/zhaolab/anaconda3/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE", line 155, in main
len_res, genenumber, filename = Countexpression(filename=args.out, allelement=allelement, genenumber=args.genenumber, cellnumber=args.cellnumber, hdf5=args.hdf5)
File "/home/zhaolab/anaconda3/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/scTE/base.py", line 505, in Countexpression
adata = ad.AnnData(np.asarray(data),var = var,obs = obs)
File "/home/zhaolab/anaconda3/lib/python3.9/site-packages/anndata/_core/anndata.py", line 307, in init
self._init_as_actual(
File "/home/zhaolab/anaconda3/lib/python3.9/site-packages/anndata/_core/anndata.py", line 462, in _init_as_actual
_check_2d_shape(X)
File "/home/zhaolab/anaconda3/lib/python3.9/site-packages/anndata/_core/anndata.py", line 96, in _check_2d_shape
raise ValueError(
ValueError: X needs to be 2-dimensional, not 1-dimensional.

chromosome number must be consistent when using scTE ?

Hi Jiekai Lab,
Very nice work !
one problem is that the chromosome number (1,2,3,4....., X, Y.) of the bam file must be consistent with the chromosome number (chr1,chr2.....,chrX,chrY) of the gene annotation file (hg38.exclusive.idx) ?

Actually, my bam files use chromosome numbers 1, 2, 3, 4...., X, Y. However, the genome annotations file (hg38.exclusive.idx) generated by scTE was numbered chr1, chr2..., chrX, chrY . Then, I tested my bam files and annotated with hg38.exclusive.idx and get the result, but I don’t know if the result obtained by this way is right?

Thank you very much.

Kui Duan

custom transgenes

Brilliant package and super easy to use!

I am trying to add some custom transgenes (TagBFP2,EGFP, TdTomato, etc.) as I have successfully done in the past with CellRanger and StarSolo but I can't seem to get it to work. (They don't appear and the TE results differ with the custome ref.)

Specifically, I use:
awk 'BEGIN{FS=OFS="\t"}{print $6,$7,$8,$11,$3,$10}' mus.rmsk.txt > mm10rmsk.bed

That comes out like this:
chr1 3000000 3002128 L1_Mus3 105 -
chr1 3003152 3003994 L1Md_F 268 -
chr1 3003993 3004054 L1_Mus3 279 -

Then I run with tmp.gtf being the input I use for making my StarSolo, kbtools, or CellRanger refs with transgenes as additional chromosomes (I'm assuming this is the problem?):
scTE_build -te mm10rmsk.bed -gene tmp.gtf -o custome

[Gtf tail]
BFP2 BFP2 exon 1 89 . + 0 gene_id "BFP2"; transcript_id "BFP2.1"; gene_name "BFP2";
BFP2 BFP2 transcript 1 89 . + 0 gene_id "BFP2"; transcript_id "BFP2.1"; gene_name "BFP2";
mTom mTom exon 1 670 . + 0 gene_id "mTom"; transcript_id "mTom.1"; gene_name "mTom";
mTom mTom transcript 1 670 . + 0 gene_id "mTom"; transcript_id "mTom.1"; gene_name "mTom";
mGFP mGFP exon 1 207 . + 0 gene_id "mGFP"; transcript_id "mGFP.1"; gene_name "mGFP";
mGFP mGFP transcript 1 207 . + 0 gene_id "mGFP"; transcript_id "mGFP.1"; gene_name "mGFP";

And then:
scTE -p 20 -i /star_out/PBS/Aligned.sortedByCoord.out.bam -o PBScustTE -x /custome.exclusive.idx --hdf5 True -CB CB -UMI UB

The custom transgenes are in the Bam.

But in Scanpy, the BFP2,mTom, and mGFP don't appear in the var_names with the genes and TEs. And the TEs are slightly different than I see when using your mm10 index.

Any suggestions?

Thanks in advance!

[Bug] Parameter of Readanno in scTEATAC is not correct

The Readanno was changed in scTE

allelement, chr_list, all_annot, glannot = Readanno(filename=outname, annoglb=args.annoglb[0]) #genome=args.genome

but Readanno in scTEATAC was not modified as you can find in

scTE/bin/scTEATAC

Line 144 in d9a300e

 allelement, chr_list, all_annot, glannot = Readanno(filename=outname, annoglb=args.annoglb[0], genome=args.genome) 

error when run on test.bam

Hi there,
Thanks for providing such wonderful pipeline.
I gave a try on the test.bam file in the Data folder, but got some error. Please check the details below.
Two command lines two used:
scTE_build -g mm10
scTE -i test.bam --min_genes 1 -o out.test -x mm10.exclusive.idx --hdf5 True -CB CR -UMI UR
And the log file got some error when 'Calculating expression':

INFO : Parameter list:
Sample = out.test
Reference annotation index = mm10.exclusive.idx
Minimum number of genes required = 1
Minimum number of counts required = None
Number of threads = 1

INFO : Loading the genome annotation index... 2021-07-14 13:43:59
INFO : Loaded 'mm10.exclusive.idx' binary file with 3900779 items
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '3', '4', '5', '6', '7', '8', '9', 'M', 'X', 'Y']
INFO : Finished loading the genome annotation index... 2021-07-14 13:52:37
INFO : Processing BAM/SAM files ...2021-07-14 13:52:37
INFO : Input SAM/BAM file appears to be valid
CR UR good
INFO : Done BAM/SAM files processing ...2021-07-14 13:52:52
INFO : Splitting ...2021-07-14 13:52:52
INFO : Executing single thread path
INFO : Finished processing sample files 2021-07-14 13:52:52
INFO : Fetching from the annotation index... 2021-07-14 13:52:52
INFO : Done fetching... 2021-07-14 13:52:54
INFO : Calculating expression... 2021-07-14 13:52:54
Traceback (most recent call last):
File "/home/setup/zhu/biotools/miniconda3/bin/scTE", line 4, in
import('pkg_resources').run_script('scTE==1.0', 'scTE')
File "/home/setup/zhu/biotools/miniconda3/lib/python3.7/site-packages/pkg_resources/init.py", line 666, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/setup/zhu/biotools/miniconda3/lib/python3.7/site-packages/pkg_resources/init.py", line 1462, in run_script
exec(code, namespace, namespace)
File "/home/setup/zhu/biotools/miniconda3/lib/python3.7/site-packages/scTE-1.0-py3.7.egg/EGG-INFO/scripts/scTE", line 169, in
main()
File "/home/setup/zhu/biotools/miniconda3/lib/python3.7/site-packages/scTE-1.0-py3.7.egg/EGG-INFO/scripts/scTE", line 155, in main
len_res, genenumber, filename = Countexpression(filename=args.out, allelement=allelement, genenumber=args.genenumber, cellnumber=args.cellnumber, hdf5=args.hdf5)
File "/home/setup/zhu/biotools/miniconda3/lib/python3.7/site-packages/scTE-1.0-py3.7.egg/scTE/base.py", line 505, in Countexpression
adata = ad.AnnData(np.asarray(data),var = var,obs = obs)
File "/home/setup/zhu/biotools/miniconda3/lib/python3.7/site-packages/anndata/_core/anndata.py", line 321, in init
filemode=filemode,
File "/home/setup/zhu/biotools/miniconda3/lib/python3.7/site-packages/anndata/_core/anndata.py", line 462, in _init_as_actual
_check_2d_shape(X)
File "/home/setup/zhu/biotools/miniconda3/lib/python3.7/site-packages/anndata/_core/anndata.py", line 97, in _check_2d_shape
f"X needs to be 2-dimensional, not {len(X.shape)}-dimensional."
ValueError: X needs to be 2-dimensional, not 1-dimensional.

Do you have any suggestions on this error?
Thank you very much!

only TE analysis

Hi ! I have another question.

In the Figure 1e of your article, I understood that you performed a downstream analysis only based on the expression of TE.
Does it mean that you filtered the count table you got after running scTE to maintain only the rows with TE name (and exclude genes) before performing the downstream analysis (Scanpy/Seurat)?

Thanks for your answer!

Jihed

How to use scTE to analyze the single cell datasets of zebrafish?

Hello!
I wanted to analyze the single cell RNA sequencing datasets of zebrafish, but I didn't know how to set the parameters and modify the codes based on zebrafish genome. Can you tell me how to change the code to analyze the datasets of zebrafish?
Thank you!

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

I filtered the bam files using this awk command:

samtools view possorted_genome_bam.bam -h | awk '/^@/ || /CB:/ && /UB:/' | samtools view -h -b > possorted_genome_bam.filtered.bam

then I do:

scTE -i possorted_genome_bam.filtered.bam -o out -x /home/lenail/scTE/hg38.exclusive.idx --hdf5 True -CB CB -UMI UB --thread 4

but I get this error:

DEBUG   : Creating converter from 7 to 5
DEBUG   : Creating converter from 5 to 7
DEBUG   : Creating converter from 7 to 5
DEBUG   : Creating converter from 5 to 7
INFO    : Parameter list:
Sample = /net/bmc-lab5/data/kellis/users/lenail/PFC_aging/scTE/D19-4296/out
Reference annotation index = /home/lenail/scTE/hg38.exclusive.idx
Minimum number of genes required = 200
Minimum number of counts required = None
Number of threads = 4

INFO    : Loading the genome annotation index... 2022-08-25 21:40:48
INFO    : Loaded '/home/lenail/scTE/hg38.exclusive.idx' binary file with 4778929 items
INFO    : Finished loading the genome annotation index... 2022-08-25 21:41:42

INFO    : Processing BAM/SAM files ...2022-08-25 21:41:42
INFO    : Input SAM/BAM file appears to be valid
INFO    : Done BAM/SAM files processing ...2022-08-25 23:58:40

INFO    : Splitting ...2022-08-25 23:58:40
INFO    : Executing multiple thread path with 4 threads
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '3', '4', '5', '6', '7', '8', '9', 'M', 'X', 'Y']
CB UB good


gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated

gzip: out_scTEtmp/o1/out.bed.gz: invalid compressed data--format violated
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/lenail/.conda/envs/py39/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/lenail/.conda/envs/py39/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/lenail/.conda/envs/py39/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/scTE/base.py", line 366, in splitChr
    CRs[t[3]] += 1
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/lenail/.conda/envs/py39/bin/scTE", line 4, in <module>
    __import__('pkg_resources').run_script('scTE==1.0', 'scTE')
  File "/home/lenail/.conda/envs/py39/lib/python3.9/site-packages/pkg_resources/__init__.py", line 672, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/lenail/.conda/envs/py39/lib/python3.9/site-packages/pkg_resources/__init__.py", line 1472, in run_script
    exec(code, namespace, namespace)
  File "/home/lenail/.conda/envs/py39/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE", line 169, in <module>
    main()
  File "/home/lenail/.conda/envs/py39/lib/python3.9/site-packages/scTE-1.0-py3.9.egg/EGG-INFO/scripts/scTE", line 134, in main
    pool.map(partial_work, chr_list)
  File "/home/lenail/.conda/envs/py39/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/lenail/.conda/envs/py39/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
IndexError: list index out of range

Any ideas?

ERROR on 10* single cell bam

Hi,
I'm not familiar with Python, so I have no idea how to deal with this error. When I run scTE on bam file from cellranger, I got an index error: list index out of range in line 263 in splitAllChrs. Do you know what I should do to solve this error? Thanks!!

got less differential expressed gene

when I tried to find differential expressed genes and TEs for different cell clusters using expression matrix containing both genes and TEs, I noticed that I got less differential expressed genes comparing with using expression matrix containing genes only (there is no difference with TEs). I assumed that the counts of TEs are much higher than genes, so some differential expressed genes whose counts is low cannot be found after normalization. Is my assumption resonable?How to normalize data properly? How to solve such a problem？

checkCBUMI may not work as expected

checkCBUMI may not work as expected in

scTE/scTE/base.py

Line 109 in d9a300e

def checkCBUMI(filename,out,CB,UMI):

I find some record in bam that generated by CellRanger will not have CB:Z, so the result in testCR.txt will not equal to 100.

So, could you improve the algorithm of the checkCBUMI?

reference hg19

I have established a custom reference for hg19.but when I use scTE command, I can only choose -g hg38. Does this situation make a difference for my outputs ?

Error Creating Custom reference for genome

Hi Jiekai Lab - thank you for developing such great resouce!

I keep on having an issue to create custom genomes for scRNA analysis. Even when I run the example data with:
scTE_build -te ./Data/TE.bed -gene ./Data/Gene.gtf -o test.idx

I get the error: "Counting genome other not supported"

Could you please help me? I've ruled out installation issues as I don't have any other problems in reproducing the other parts of your pipeline (e.g., with scATAC or built in genomes)

Thank you

Ivan Ferreira
Sauka-Spengler Lab

ImportError: cannot import name 'Literal'

Hi @jphe ,

I am trying to use scTE but get the following error:

DEBUG   : Creating converter from 7 to 5
DEBUG   : Creating converter from 5 to 7
DEBUG   : Creating converter from 7 to 5
DEBUG   : Creating converter from 5 to 7
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/anndata-0.8.0-py3.6.egg/anndata/compat/__init__.py", line 65, in <module>
    from typing import Literal
ImportError: cannot import name 'Literal'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/scTE", line 4, in <module>
    __import__('pkg_resources').run_script('scTE==1.0', 'scTE')
  File "/home/asaera/.local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 650, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/asaera/.local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1446, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.6/dist-packages/scTE-1.0-py3.6.egg/EGG-INFO/scripts/scTE", line 13, in <module>
    from scTE.base import *
  File "/usr/local/lib/python3.6/dist-packages/scTE-1.0-py3.6.egg/scTE/base.py", line 16, in <module>
    import anndata as ad
  File "/usr/local/lib/python3.6/dist-packages/anndata-0.8.0-py3.6.egg/anndata/__init__.py", line 7, in <module>
    from ._core.anndata import AnnData
  File "/usr/local/lib/python3.6/dist-packages/anndata-0.8.0-py3.6.egg/anndata/_core/anndata.py", line 27, in <module>
    from .raw import Raw
  File "/usr/local/lib/python3.6/dist-packages/anndata-0.8.0-py3.6.egg/anndata/_core/raw.py", line 11, in <module>
    from .aligned_mapping import AxisArrays, AxisArraysView
  File "/usr/local/lib/python3.6/dist-packages/anndata-0.8.0-py3.6.egg/anndata/_core/aligned_mapping.py", line 11, in <module>
    from ..utils import deprecated, ensure_df_homogeneous
  File "/usr/local/lib/python3.6/dist-packages/anndata-0.8.0-py3.6.egg/anndata/utils.py", line 11, in <module>
    from ._core.sparse_dataset import SparseDataset
  File "/usr/local/lib/python3.6/dist-packages/anndata-0.8.0-py3.6.egg/anndata/_core/sparse_dataset.py", line 23, in <module>
    from ..compat import _read_attr
  File "/usr/local/lib/python3.6/dist-packages/anndata-0.8.0-py3.6.egg/anndata/compat/__init__.py", line 68, in <module>
    from typing_extensions import Literal
  File "/usr/local/lib/python3.6/dist-packages/typing_extensions-4.2.0-py3.6.egg/typing_extensions.py", line 159, in <module>
    class _FinalForm(typing._SpecialForm, _root=True):
AttributeError: module 'typing' has no attribute '_SpecialForm'

The code for scTE is as follows

mainD='/media/path/to/folder/'
inF="${mainD}STARsolo/sample1/"
inBAM="${inF}Aligned.sortedByCoord.out.bam"
outDir="${mainD}scTE/sample1/"
mkdir -p $outDir
scTE -i $inBAM -o $outDir -x "${mainD}hg38.exclusive.idx" -CB CR -UMI UR

idx file was generated as follows

cd ${mainD}
scTE_build -g hg38

This is a 10X chromium V3 experiment and BAM file was generated with STARsolo, I can provide the code if you need it.

I made multiple test and I always get the same error as above. I tried parsing the BAM file to move Cell barcode and UMI from CR/UR to CB/UB, I tried using out for -o as in the manual, I tried with both --hdf5 True and --hdf5 False but always get the same error.

Any idea or advice would be highly appreciated.

Thanks,

How to make hg38.exclusive.idx

Hi,

Thank you for the resource! I am having trouble getting started using it. Mainly, I can't seem to understand how hg38.exclusive.idx was generated. I would greatly appreciate the help.

scTE does not finish, stuck on Fetching from annotation file

Hi,
Thanks for this software, it's great and intuitive to use.
I was trying to analyze the data from Deng et al, using the first sample as an example. I build the mm10 genome with

scTE_build -g mm10 -o /scratch/reference/scTE/mm10

Then I tried to analyze one sample. The fastq file were obtained from GEO and aligned using STAR

STAR --genomeDir /scratch/Databases/Mm10/STAR_genome/ \
  --readFilesIn /scratch/singleCell/Mouse/GSE45719_Deng_2014/Fastq//SRR805173.fastq.gz \
  --readFilesCommand gunzip -c --outFileNamePrefix /scratch/Deng/bam/SRR805173/SRR805173 \
   --outSAMtype BAM SortedByCoordinate --outSAMattributes XS --runThreadN 16 \
  --outFilterMultimapNmax 100

Then I tried to run scTE

scTE -i bam/SRR805173/SRR805173Aligned.sortedByCoord.out.bam -o counts_ -g mm10 \ 
  -x /scratch/reference/scTE/mm10.exclusive.idx -p 16    --expect-cells 300 -CB False -UMI False

However, it is blocked for hours now after printing INFO: fetching from the annotation index 2021-03-17 16:05:27

Do you have any idea what could cause this?
Best

TE_genes_id.mm10.txt?

Hi, thanks for providing such a useful tool! I want to use it to analyze my single cell data. And in the example code ,3.diffexp.py, line 44 used the file 'TE_genes_id.mm10.txt' . I didn't find this file in the main text and supplementary.
I wonder how can I to get this file of hg38 to go through the next-step analysis?

scTE building customs reference index ERROR !

Hi, @jphe
it's the best tool! I also encountered the same problem as #3，but the species is Macaca mulatta.
gene annotation file was downloaded from http://ftp.ensembl.org/pub/release-104/gtf/macaca_mulatta/Macaca_mulatta.Mmul_10.104.gtf.gz,
and repeatmask file was downloaded from http://hgdownload.soe.ucsc.edu/goldenPath/rheMac10/database/rmsk.txt.gz .

Then, I transformed the repeatmask file into a six-column bed file with the option awk 'BEGIN{FS=OFS="\t"}{print $6,$7,$8,$11,$3,$10}' rmsk.txt > mmul10rmsk.bed and make sure the chromosome name consistent with gene annotation file.
Lastly, I building the index scTE_build -te mmul10rmsk.bed -gene Macaca_mulatta.Mmul_10.104.gtf -o Mmul_10scTE.idx.
However, I get the ERROR : Counting genome other not supported.

Any tips are appreciated !
Thank you for your generous help!

Custom reference for non-human, non-mouse genome

Hi, the ReadMe file says "If you want to use your customs reference, you can use the -gene -te options:". We understood this as being able to use your code on other genomes than the mouse and the human. We tried this command to build the index:
scTE_build -te /path/to/hsal_v8.5_filtered_unique_ids.bed -gene /path/to/hsal_v8.5_genes_update16.gtf -o /path/to/scTE_build_1.idx
and we got the following error message:
scTE_build: error: the following arguments are required: -g/--genome
In the ReadMe file example the -g argument is not supplied for building a custom index. Why is it required? Any tips are appreciated. Thank you.

scTE analysis Bulk RNA-seq data

hello, JiekaiLab team
A grateful software for us to research TE. We can see scTE can be used to quantitate TEs reads base on Bulk RNA-seq data with the setting -CB False -UMI False. My question is if the scTE output value needs to correct with EDASeq software?

Thank you very much!

Kui Duan

Error when running scTEATAC

Hello,

I'm trying to run this on my scATAC-seq sample and I'm running into this issue:
```TypeError: Readanno() got an unexpected keyword argument 'genome'``
I built the hg38 index using the same method you did with you mm10 genome on the README.
scTEATAC -i filtered.bam -x hg38.te.atac.idx -p 16 -CB -g hg38
Any ideas why this is an issue?

Thanks,
Chang

args.out <---> filename

Hello,
I read your research paper on TE quantification in single-cell data. Very witty!

It prompted me to try out your pipeline, but it crushed with an error soooo close to an end. I looked the source code on git hub, and it seems I had found the source of an error.

At the line 155 where the Counterexpression function is called, for a filename argument it takes args.out, which is an output path specified by the user. Throughout the pipeline the outname ( basename of args.out ) is used, and all of the temporary files are saved to the working directory using the outname as a prefix for tmp folder.

Thus, when Counterexpression is called with args.out, it looks for temporary files following incorrect paths;
from base.py:

def Countexpression(filename, allelement, genenumber, cellnumber, hdf5):
gene_seen = allelement

whitelist={}
o = gzip.open('%s_scTEtmp/o4/%s.bed.gz'%(filename, filename), 'rt')

Temporarily changing args.out to outname in Counterexpression function call solved this problem for me and pipeline ran smoothly. All the output files were saved to the directory from where the original script had been run.

This error only occurs if args.out is provided as a path. However, the error is not noticeable until the very end of the pipeline. I would suggest to change args.out to outname in Counterexpression function call on the line 155 to prevent this unexpected behaviour

limit of 10000 cells?

Hi, I'm trying to analyze single cell data and I expect more than 10,000 cells per sample. However, for these samples, they each say 10,000 cells detected in the output. Is there a cutoff at 10,000 or a way to get around this?
Thank you.

incomplete tmp folder

there should be a folder named samplename_tmp, and you can see the files under that folder

Originally posted by @jphe in #4 (comment)

Hi,

I'm wondering why I don't have the o3 and o4 folders under the tmp folder.
https://user-images.githubusercontent.com/52441289/114651476-a0599980-9d16-11eb-93e5-f2b8fe60ae09.png

broken pipe while processing

Hi all, I am trying scTE on some scRNA-seq data of mine (hg38). I have BAM files generated with STARSolo and following the instructions I'm quantifying TE like this:

scTE -i ${SAMPLE}Aligned.sortedByCoord.out.bam -o ${SAMPLE}_TE -x hg38.exclusive.idx --hdf5 True -CB CR -UMI UR -p 8

All samples are being processed but in some log files I'm finding this message:

[…]
INFO    : Loading the genome annotation index... 2021-11-18 11:13:28
INFO    : Loaded '/beegfs/scratch/ric.cosr/cittaro.davide/Ref/scTE/hg38/hg38.exclusive.idx' binary file with 4779764 items
INFO    : Finished loading the genome annotation index... 2021-11-18 11:14:06 

INFO    : Processing BAM/SAM files ...2021-11-18 11:14:06
INFO    : Input SAM/BAM file appears to be valid
sed: couldn't write 50 items to stdout: Broken pipe
sed: couldn't write 53 items to stdout: Broken pipe
sed: couldn't write 58 items to stdout: Broken pipe
awk: cmd. line:1: (FILENAME=- FNR=131567431) fatal: print to "standard output" failed (Broken pipe)

The forked process

samtools view -@ 8 HCT116_FOLFIRI_LTAligned.sortedByCoord.out.bam | awk '{OFS="?"}{for(i=12;i<=NF;i++)if($i~/CR:Z:/)n=i}{for(i=12;i<=NF;i++)if($i~/UR:Z:/)m=i}{print $3,$4,$4+100,$n,$m}' | sed -r 's/CR:Z://g' | sed -r 's/UR:Z://g'| sed -r 's/^chr//g' | awk '!x[$4$5]++' | gzip -c > HCT116_FOLFIRI_LT_TE_scTEtmp/o1/HCT116_FOLFIRI_LT_TE.bed.gz

is still running (apparently) but, compared to other processes launched in the same moment, it seems I'm stuck in generating the content of o1 folder. Any hint?

Work with Seurat and bam files merge

Very nice work! Now I have two bam files from the cellranger (possorted_genome_bam.bam), how can I merge those two bam files with scTE, AND because I have run the Seurat before scTE, how to transfer the analysis results from Seurat like cell ananation and cell emmbeding to scTE.THANKs

KeyError: 'M'

Hi.

I am so confused about the KeyError: 'M'. I've used it with the sam bam file, however, now it report the error.I tried few times and reinstall the git. The bug still exits.

DEBUG : Creating converter from 7 to 5
DEBUG : Creating converter from 5 to 7
DEBUG : Creating converter from 7 to 5
DEBUG : Creating converter from 5 to 7
INFO : Parameter list:
Sample = out
Reference annotation index = /home/data/xxxx/xxxxxx/scte/macFas5.exclusive.idx
Minimum number of genes required = 200
Minimum number of counts required = None
Number of threads = 1

INFO : Loading the genome annotation index... 2022-03-08 12:10:31
INFO : Loaded '/home/data/xxxxx/xxxxx/scte/macFas5.exclusive.idx' binary file with 4428688 items
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '3', '4', '5', '6', '7', '8', '9', 'X']
INFO : Finished loading the genome annotation index... 2022-03-08 12:11:28

INFO : Processing BAM/SAM files ...2022-03-08 12:11:28
INFO : Input SAM/BAM file appears to be valid
CR UR good

INFO : Done BAM/SAM files processing ...2022-03-08 13:26:22

INFO : Splitting ...2022-03-08 13:26:22
INFO : Executing single thread path
Traceback (most recent call last):
File "/home/data/xxxx/miniconda3/envs/xxxxxxx/bin/scTE", line 4, in
import('pkg_resources').run_script('scTE==1.0', 'scTE')
File "/home/data/xx/miniconda3/envs/xxxxxx/lib/python3.8/site-packages/pkg_resources/init.py", line 651, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/data/xxx/miniconda3/envs/xxxxxx/lib/python3.8/site-packages/pkg_resources/init.py", line 1448, in run_script
exec(code, namespace, namespace)
File "/home/data/xxx/miniconda3/envs/xxxxx/lib/python3.8/site-packages/scTE-1.0-py3.8.egg/EGG-INFO/scripts/scTE", line 169, in
main()
File "/home/data/xxxxx/miniconda3/envs/xxxxxx/lib/python3.8/site-packages/scTE-1.0-py3.8.egg/EGG-INFO/scripts/scTE", line 129, in main
whitelist = splitAllChrs(chr_list, filename=outname, genenumber=args.genenumber, countnumber=args.countnumber, UMI=args.UMI)
File "/home/data/xxxxxxxx/miniconda3/envs/xxxxxx/lib/python3.8/site-packages/scTE-1.0-py3.8.egg/scTE/base.py", line 260, in splitAllChrs
if line in uniques[chrom]:
KeyError: 'M'

How can I fix it?

Thanks and best wishes,
Chris

imposible to read output file

Hello,
Thanks for scTE!
I did run it in my mouse scRNA seq data, however, I haven't been able to read the output file... it is too big!
I have a .csv of 60gb for each of my samples and I can't read them neither using a cluster with 500gb of RAM.
As I just want to analyze the expression of transposable elements (specifically the ERVs), I wondered if there is a way to obtain the counts for the TEs only and/or to filter out the genes and TEs with 0 counts, in order to get a smaller output file?

(when I tried to run it with --hd5f True, it didn't work because it went out if memory even with --mem=200G)

thanks a lot for your help!
Javiera.

scTE: error: argument -CB: invalid choice: 'CB' (choose from 'True', 'False')

Hello,

Thanks for the scTE at first, but I met problem when I use scTE, and I guess this is because the parameters of "-CB" and "-UMI".

the error is: scTE: error: argument -CB: invalid choice: 'CB' (choose from 'True', 'False')

My commands are:
scTE -i /Volumes/Backup\ Plus/F19FTSSCWLJ0261_10X/bam/F03_PBS_MG.possorted_genome_bam.bam -o /Volumes/Backup\ Plus/F19FTSSCWLJ0261_10X/bam/F03_PBS_MG.possorted_genome_bam_scTE -x mm10.exclusive.idx --hdf5 True -CB CB -UMI UB -g mm10

and part of my bam files are:
E00490:510:HYC2TCCXY:5:2105:25022:45646 256 1 3044966 0 30S121M * 0 0 AAGCAGTGGTATCAACGCAGAGTACATGGGAGAGAAAAACAAACCTGGGTATGCCTCGTAGTTAAAACATTCCTGGGAACATCTTGACCATAAGATAAAGGGGACTGTGAAGACATAGCAGGGCTATATTATCTAAGTCAACACCATCTGG AAFFFJFJJJJJJJJJJJJJJJJJJJJJJJJJJFAFJJFFJFFFJJJFFAFJFFJ7FFFJJJJJJFJJJJJFJJJJFJJJJJJJJFJJJJF7FJFJF-JJFJJ-<A<AFJJJJJJJFJJFJJJJJJ7AFFJ<FJAAAAAF<JJJJJFJJJJ NH:i:6 HI:i:3 AS:i:119 nM:i:0 NM:i:0 CR:Z:GCGCCAACATTGAGCT CY:Z:AAAFFFJJJJJJJJJJ CB:Z:GCGCCAACATTGAGCT-1 UR:Z:ATAGCAAGCA UY:Z:JJJJJJJFFJ UB:Z:ATAGCAAGCA BC:Z:CCTTTGTC QT:Z:AAFFFJJJ TR:Z:TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCATTACAAGAGGGCGTTTAATTTCGGGGGTTCAAACCGACATTCATCCCACACAAGAGATAATGGGAGTCGGCCACGGGGGCCGCAAGACAGGG TQ:Z:JJJJJFJJJJJJJJJJJJJJJJJJJJJJJ<--<-A--<----<)---AF--A---7<)))))7----7--))7--7-7<F<))---7-<-7<F--7--))7)<))))-)--)-)-))<A----<F RG:Z:F03_PBS_MG:MissingLibrary:1:HYC2TCCXY:5

I saw in your Readme.txt file that the value of parameter "-CB" and "-UMI" are "CB" and "CR", but the recommend value in my terminal shows only can choose "False" or "True"....

I'm looking forward to your reply, Thanks a lot!

Xiaoyu

Error when running scTEATAC: 'genome required'

Hi,

I have a problem running the command 'scTEATAC': When runnning
scTEATAC -i input.bam -o outfile -x mm10.te.atac.idx I receive the error

scTE_scatacseq: error: the following arguments are required: -g/--genome

However, when running the command scTEATAC -i input.bam -o outfile -x mm10.te.atac.idx -g mm10 another error occurs:

TypeError: Readanno() got an unexpected keyword argument 'genome'

Could you please provide an example on how to run scTE on the output from Cellranger-atac (10x scATAC-seq data)? The example provided in the README does not work for me.

Thanks and best wishes,
Malte

scTE freezes after beginning BAM file processing

Hello! I am trying to use scTE on BAM files generated from STARsolo. The tool gets to the point where it says that the BAM files look good but does not progress. I have ensured that I have installed the latest version = 1.0 from the JiekaiLab git and have tried adjusting the -p from 1 to 8 with no success. Below is the output from the tool, any advice on how to fix this would be appreciated!

/wynton/home/greenelab/mkinisu/scripts/scTE_virtualenv/bin/python
DEBUG : Creating converter from 7 to 5
DEBUG : Creating converter from 5 to 7
DEBUG : Creating converter from 7 to 5
DEBUG : Creating converter from 5 to 7
INFO : Parameter list:
Sample = /wynton/scratch/mkinisu/Weili_HIV_STAR/Weili_HIV/WKWG04_C1_S9_L004
Reference annotation index = /wynton/home/greenelab/mkinisu/ref/scTE/hg38.exclusive.idx
Minimum number of genes required = 200
Minimum number of counts required = None
Number of threads = 8

INFO : Loading the genome annotation index... 2022-03-10 15:51:28
INFO : Loaded '/wynton/home/greenelab/mkinisu/ref/scTE/hg38.exclusive.idx' binary file with 4750078 items
INFO : Finished loading the genome annotation index... 2022-03-10 15:52:31

INFO : Processing BAM/SAM files ...2022-03-10 15:52:31
INFO : Input SAM/BAM file appears to be valid

jiekailab / scte Goto Github PK

scte's People

Contributors

Stargazers

Watchers

Forkers

scte's Issues

I ran scTEATAC with the following options, and it threw an IndexError. I am pasting the details below. Could you please help?

And I got this problem. How could I to figure out this problem, please? Thank you.

Recommend Projects

Recommend Topics

Recommend Org

Jobs

And I got this problem. How could I to figure out this problem, please?
Thank you.