gaolabtools / scnanogps Goto Github PK

Single cell Nanopore sequencing data for Genotype and Phenotype

License: Other

Python 92.56% Shell 2.08% R 5.36%

cell-barcode demultiplexing gene-expression isoforms long-read-sequencing nanopore-analysis-pipeline rna-seq-pipeline single-cell single-nucleotide-variation umi-curation

scnanogps's People

Contributors

Stargazers

Watchers

Forkers

docmab23 nathanaelandrews

scnanogps's Issues

Low quality reads filtering

Hi, thanks for developing this amazing tool! I have one question regarding the scanner part.

One of the incentives in using ONT is to profile highly repetitive regions. However, I was not able to find any repeat expansion in the final output alignment files. So I went back to the raw fastq files and extracted reads carrying at least X number of repeats and store them in a separate fastq file. When I try to run scanner on these reads, the log would look something like this.

Parameters for pattern search:
	Length of barcode:             16
	Length of UMI:                 12
	5'-adaptor sequence:           AAGCAGTGGTATCAACGCAGAGTACAT
	3'-adaptor sequence:           CTACACGACGCTCTTCCGATCT
	PolyT sequence:                TTTTTTTTTTTT
	Scanning region length:        100

Penalty for dynamic programming:
	Matching:                      2
	Mismatching:                   -3
	Gap opening:                   -5
	Gap extension:                 -2
	Editing distance:              2

Parameters for computing:
	Number of computer cores:      16
	Number of reads per batch job: 1000
	Minimal length of read:        200
	Matching threshold:            0.7
	Scoring threshold:             0.4

Debug mode switch:             False

Total 717 reads are processed.
Time elapse: 0 : 0 : 0.57
Detecting rate: 0.00%

Result counting:
	Number of 3'-adaptor located on the read head region:           	0
	Number of 3'-adaptor + polyT on the read head region:           	0
	Number of 3'-adaptor located on the read tail region:           	0
	Number of 3'-adaptor + polyT on the read tail region:           	0

Alignment counting:
	Number of 3'-adaptor having no mismatch:                            	0

	Number of 3'-adaptor having mismatch at the last one position:      	0
	Number of 3'-adaptor having mismatch at all the last two position:  	0
	Number of 3'-adaptor having mismatch at all the last three position:	0

	Number of 3'-adaptor having in/del at the last one position:        	0
	Number of 3'-adaptor having in/del at the last two position:        	0
	Number of 3'-adaptor having in/del at the last three position:      	0

	Number of rescued truncated 3'-adaptor on the read head region: 	0
	Number of rescued truncated 3'-adaptor on the read tail region: 	0

Finish time stamp: Mon, 13 Nov 2023 18:39:39

Could you elaborate a bit more on how this step works and why none of the reads are retained afterward? These reads are of lower quality score due to the repetitiveness. My guess is because within the 100 bp scanning window of both 5' and 3' we are not getting the barcodes, adaptors etc. due to the low quality so they are not retained. But could you clarify how this works? Thanks!

matrix_SNV.raw.vcf.gz unzipped in Step 5

Hi,
first of all, thank you for developing scNanoGPS. I am trying to do SNV calling from the raw fastq files of our sequencing. I was running into an issue in the 5th step (5.3. single cell SNV profile). This is the traceback of the error:

Merge Longshot result...
Merging part 10 of 10 ...
Generating final matrix...
Filtering by prevalence: 0.01 ...
Traceback (most recent call last):
File "/hpc/pmc_holstege/oscar/scNanoGPS/snvcalling/../reporter_SNV.py", line 540, in
merge_longshot(CB_list, options)
File "/hpc/pmc_holstege/oscar/scNanoGPS/snvcalling/../reporter_SNV.py", line 136, in merge_longshot
filter_by_prevalence(options)
File "/hpc/pmc_holstege/oscar/scNanoGPS/snvcalling/../reporter_SNV.py", line 62, in filter_by_prevalence
fh = bgzf.open(os.path.join(options.o_dir, options.o_pref) + '.raw.vcf.gz', 'rt')
File "/hpc/pmc_holstege/oscar/anaconda3/envs/scNanoGPS/lib/python3.9/site-packages/Bio/bgzf.py", line 273, in open
return BgzfReader(filename, mode)
File "/hpc/pmc_holstege/oscar/anaconda3/envs/scNanoGPS/lib/python3.9/site-packages/Bio/bgzf.py", line 617, in init
self._load_block(handle.tell())
File "/hpc/pmc_holstege/oscar/anaconda3/envs/scNanoGPS/lib/python3.9/site-packages/Bio/bgzf.py", line 644, in _load_block
block_size, self._buffer = _load_bgzf_block(handle, self._text)
File "/hpc/pmc_holstege/oscar/anaconda3/envs/scNanoGPS/lib/python3.9/site-packages/Bio/bgzf.py", line 444, in _load_bgzf_block
raise ValueError(
ValueError: A BGZF (e.g. a BAM file) block should start with b'\x1f\x8b\x08\x04', not b'##fi'; handle.tell() now says 4

Looking into the scNanoGPS_res dir, I see that both matrix_SNV.raw.vcf.gz and matrix_SNV.filtered.vcf.gz have been created, but when I do 'file matrix_SNV.raw.vcf.gz' in the terminal this is the output: 'matrix_SNV.raw.vcf.gz: Variant Call Format (VCF) version 4.2, ASCII text, with very long lines'; and matrix_SNV.filtered.vcf.gz is empty. My intuition is that this file is not being zipped and is causing the problem, but I am not really sure how it works.

Do you have any suggestions on what might be causing the problem? Thank you a lot!

summary.txt of example input data

Hi Cheng-kai,

Could you confirm that the summary.txt below for example data is correct?

While using the scNanoGPS with my test data, I am also using it with the example data (demo data/toy data) in GitHub page. I come to the final 5 Reporter process without error.
I would like to know that the result is correct or not, so that I am confident that the installation of scNanoGPS and the handling commands are correct.

ead yield:                  7731
Valid read number:           7731
Detecting rate:              100.0%

Median read length:          821.0
Mean read length:            984.92
Maximal read length:         5546
Median cell barcode quality: 23.31
Mean cell barcode quality:   22.74

Cell number:                 56
Raw reads per cell:          138.05
UMI counts:                  7047
Median UMI counts per cell:  124.0
Mean UMI counts per cell:    125.84
Median gene number:          53.0
Mean gene number:            49.36

Exonic:                      13.18%
Intronic:                    70.94%
Intergenic:                  15.88%

Best regards,
Minoru

request of an option to output unassigned read fastq file at the scanner and assigner process

Hi Cheng-Kai,

This is the request.

if you could add an option to output unassigned read fastq file at the scanner process and the assigner process, the option will be helpful to investigate/check the reason of unassignment in case the detecting rates are lower then expected.

At this moment, I am able to get the same information by differencial comparison between input fastq data and processed.fastq / the barcode_list. But it is helpful if I can just use the option without doing anyfurther process after the scanner and the assigner.

Best regards,
Minoru

Questions on running the example data

Thank you for creating such a nice tool! This definitely aid the single-cell long-reads research. I am running the steps in tutorial/ run_scNanoGPS.sh with the example data.

The resulted "matrix.tsv" is a gtf-like file without any count information.

Geneid Chr Start End Strand Length gene_name
ENSG00000279973 chr22 11066418 11068174 + 1757 CU104787.1
ENSG00000280341 chr22 15282557 15288670 - 6114 AP000542.3
ENSG00000279442 chr22 15298378 15304556 - 6179 AP000542.2

The resulted "matrix_isoform.tsv" is completely empty.

The resulting log:
reporter_isoform.py -t 20 --liqa_ref example/GRCh38_chr22.liqa.refgene
Output directory: scNanoGPS_res
Temporary directory: tmp
Reference of LIQA: example/GRCh38_chr22.liqa.refgene
Batch LIQA spends 0 : 0 : 0.03

Please let me know whether this is normal as the example data is subsampled or I am having some problems in my installation steps.

suggestion - ressources needed

Hi,

Thank you for your work and this tool that is very useful.

Would it be possible to have more info about current ressources needed for each steps and an approximate time of run ? Because i'm working in a HPC and i must share my ressources with other so i can't use 400Go of RAM for one entire week.

Thank you in advance

Error during scanner

Hi! Thanks for developing this amazing tool.

I am running into this issue when trying to run scanner on my fastq files.

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/software/python-anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/software/python-anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/mra7674/scNanoGPS/scanner_core/searching_core.py", line 128, in ten_nano_workflow
    h3_ps_res = precise_search(na_seq_header, options.adaptor_three_p, 0, options.scan_region, options.scoring_threshold, options.dp_penalty)
  File "/home/mra7674/scNanoGPS/scanner_core/searching_core.py", line 58, in precise_search
    seqA  = str(alignment_res[0].seqA)
AttributeError: 'tuple' object has no attribute 'seqA'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/mra7674/scNanoGPS/scanner.py", line 70, in <module>
    tmp_data = pool.map(partial(searching_core.ten_nano_workflow, options = options), batch_data)
  File "/software/python-anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/software/python-anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
AttributeError: 'tuple' object has no attribute 'seqA'

Here is a screenshot of what my fastq file looks like. Could you help us?

Thank you!

step4 curator: ValueError: file has no sequences defined (mode='rb')

Hi author,

Thanks for this very useful tool!

I had this problem several times and also tried to solve it, but failed. Here is the issue:

Traceback (most recent call last):
File "./scNanoGPS/curator.py", line 175, in
seq_dict = curator_io.build_read_seq_dict(os.path.join(options.tmp_dir, fq_pref) + ".minimap2.bam")
File "./scNanoGPS/curator_core/curator_io.py", line 193, in build_read_seq_dict
bam_f = pysam.AlignmentFile(bam_name, "rb")
File "pysam/calignmentfile.pyx", line 340, in pysam.calignmentfile.AlignmentFile.cinit
File "pysam/calignmentfile.pyx", line 589, in pysam.calignmentfile.AlignmentFile._open
ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_seq=True

The file might not be in SAM/BAM format after fast.gz were generated for minimap2 mapping.
I added 'check_seq=True'. in 'bam_f = pysam.AlignmentFile(bam_name, "rb")' -->> 'bam_f = pysam.AlignmentFile(bam_name, "rb", check_seq=True)', but it is an unexpected keyword argument.

Therefore, could you please give me some suggestions to address minimap2 alignment step? I would appreciate it very much.

Thanks so much.

Best,
Lily

CB number of gene filters

Hi I've noticed in your paper that you've wrote "CBs with less than 300 genes are filtered out by default". However, I could not locate the corresponding codes in the scanner and Assigner steps. I only found it in the reporter_expression.py. Can I assume all CB were retained regardless of the number of UMPs or genes per cells? So I can do the typical QC in Seurat later? Thanks a lot for making this flexible tool!! Best,
Hsiao-Lin

Quick question about application of master Fastq file

Hi Cheng-Kai,

First of all, thank you so much for your shiny tool for sc long read data.
Now I've wanted to investigate their isoform diversity.
So, as I think (after reading your paper carefully), the master fastq file or merged consensus bam (using samtools merge) could be used in Flair to make novel gtf file.
Could you please give me your opinion about the strategy?

Thank you!

Cannot set up scNanoGPS environment

Hello,

When following the environment creation instructions, the command pip3 install -r requirements.txt fails with something related to pysam. The error text says the system cannot find the file specified, and that the problem likely does not originate from pip. This is on a brand new windows 10 machine using the anaconda powershell to run the commands. I'm using python 3.12.3. The full output is below:

(scNanoGPS) PS D:\scNanoGPS> pip3 install -r requirements.txt
Collecting biopython (from -r requirements.txt (line 1))
  Using cached biopython-1.83-cp312-cp312-win_amd64.whl.metadata (13 kB)
Collecting distance (from -r requirements.txt (line 2))
  Using cached Distance-0.1.3-py3-none-any.whl
Collecting liqa (from -r requirements.txt (line 3))
  Using cached liqa-1.3.4-py3-none-any.whl.metadata (2.2 kB)
Collecting matplotlib (from -r requirements.txt (line 4))
  Using cached matplotlib-3.8.4-cp312-cp312-win_amd64.whl.metadata (5.9 kB)
Collecting pandas (from -r requirements.txt (line 5))
  Using cached pandas-2.2.2-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting pysam (from -r requirements.txt (line 6))
  Using cached pysam-0.22.1.tar.gz (4.6 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [40 lines of output]
      # pysam: cython is available - using cythonize if necessary
      # pysam: htslib mode is shared
      # pysam: HTSLIB_CONFIGURE_OPTIONS=None
      '.' is not recognized as an internal or external command,
      operable program or batch file.
      '.' is not recognized as an internal or external command,
      operable program or batch file.
      # pysam: htslib configure options: None
      Traceback (most recent call last):
        File "C:\Users\ward12369user2\AppData\Local\miniconda3\envs\scNanoGPS\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 353, in <module>
          main()
        File "C:\Users\ward12369user2\AppData\Local\miniconda3\envs\scNanoGPS\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "C:\Users\ward12369user2\AppData\Local\miniconda3\envs\scNanoGPS\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "C:\Users\ward12369user2\AppData\Local\Temp\pip-build-env-9yrhlcwm\overlay\Lib\site-packages\setuptools\build_meta.py", line 325, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "C:\Users\ward12369user2\AppData\Local\Temp\pip-build-env-9yrhlcwm\overlay\Lib\site-packages\setuptools\build_meta.py", line 295, in _get_build_requires
          self.run_setup()
        File "C:\Users\ward12369user2\AppData\Local\Temp\pip-build-env-9yrhlcwm\overlay\Lib\site-packages\setuptools\build_meta.py", line 487, in run_setup
          super().run_setup(setup_script=setup_script)
        File "C:\Users\ward12369user2\AppData\Local\Temp\pip-build-env-9yrhlcwm\overlay\Lib\site-packages\setuptools\build_meta.py", line 311, in run_setup
          exec(code, locals())
        File "<string>", line 437, in <module>
        File "<string>", line 81, in run_make_print_config
        File "C:\Users\ward12369user2\AppData\Local\miniconda3\envs\scNanoGPS\Lib\subprocess.py", line 466, in check_output
          return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "C:\Users\ward12369user2\AppData\Local\miniconda3\envs\scNanoGPS\Lib\subprocess.py", line 548, in run
          with Popen(*popenargs, **kwargs) as process:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "C:\Users\ward12369user2\AppData\Local\miniconda3\envs\scNanoGPS\Lib\subprocess.py", line 1026, in __init__
          self._execute_child(args, executable, preexec_fn, close_fds,
        File "C:\Users\ward12369user2\AppData\Local\miniconda3\envs\scNanoGPS\Lib\subprocess.py", line 1538, in _execute_child
          hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      FileNotFoundError: [WinError 2] The system cannot find the file specified
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip

Curator Counting Table Not Generated

I have really appreciated having this tool to use on my long read data, thanks for developing it!

I had a question regarding the output from the Curator step. I have been able to run the curator with consensus for my samples, however the output all remains in the temporary folder, with six files for each cell barcode. No .tsv files were generated in the temporary folder or anywhere else. No error happened, and the job completed well within the time frame I gave it, so it does not appear to have ended prematurely.

I'd like to use the Reporter scripts, but the input needed appears to be a matrix which is not created for any of my samples. Do you know why this might be, or any alternative ways to ensure that the final .tsv file is generated? Otherwise, is there a way for me to use the Reporter without this matrix input?

Thanks,
-Benney

Problem with --min_gene_no option of reporter_expression.py

Hi Cheng-kai,

Thanks for considering my past questions, and appreciate in advance also for this time.
While I was testing a '--min_gene_no' option of the reporter_expression.py, I encountered some problems, and would like to ask you whether you can help me.

I ran the below command to test the --min_gene_no option:

python3 ~/shared/scNanoGPS/reporter_expression.py -t 24 --gtf ~/shared/scNanoGPS/genes.gtf --featurecounts $(which featureCounts) \
-d ${working}/${sample}/ngene_test/ --min_gene_no 1 -o 1_matrix.tsv --log 1_reporter_expression.log.txt --sel_bc_o 1_filtered_barcode_list.txt

python3 ~/shared/scNanoGPS/reporter_expression.py -t 24 --gtf ~/shared/scNanoGPS/genes.gtf --featurecounts $(which featureCounts) \
-d ${working}/${sample}/ngene_test/ --min_gene_no 10 -o 10_matrix.tsv --log 10_reporter_expression.log.txt --sel_bc_o 10_filtered_barcode_list.txt

python3 ~/shared/scNanoGPS/reporter_expression.py -t 24 --gtf ~/shared/scNanoGPS/genes.gtf --featurecounts $(which featureCounts) \
-d ${working}/${sample}/ngene_test/ --min_gene_no 50 -o 50_matrix.tsv --log 50_reporter_expression.log.txt --sel_bc_o 50_filtered_barcode_list.txt

python3 ~/shared/scNanoGPS/reporter_expression.py -t 24 --gtf ~/shared/scNanoGPS/genes.gtf --featurecounts $(which featureCounts) \
-d ${working}/${sample}/ngene_test/ --min_gene_no 250 -o 250_matrix.tsv --log 250_reporter_expression.log.txt --sel_bc_o 250_filtered_barcode_list.txt

python3 ~/shared/scNanoGPS/reporter_expression.py -t 24 --gtf ~/shared/scNanoGPS/genes.gtf --featurecounts $(which featureCounts) \
-d ${working}/${sample}/ngene_test/ -o def_matrix.tsv --log def_reporter_expression.log.txt --sel_bc_o def_filtered_barcode_list.txt (def means default)

and, I got filtered cell numbers like below, which seems very weird:

Initial cell number = 4816
--min_gene_no 1 --> filtered cell number: 4816
--min_gene_no 10 --> filtered cell number: 4816
--min_gene_no 50 --> filtered cell number: 1242
--min_gene_no 250 --> filtered cell number: 2112
--min_gene_no false (maybe 300?) --> filtered cell number: 4058

In my assumption, if min_gene cut is low, the number of filtered cells should high, but for 50 and default, they seems to be out of the trand.

Also, when I analyzed them with Seurat, I found some weird plots like below:

It appears the matrix.tsv file have problem for --min_gene_no 50 & 250 (gene_cut_50&gene_cut_250) when see the result.

To clarify, I ran these two times in the same run and did another run to separately run them with different nodes, but the results were always same.

Could you help me to find the reasons of these problems of this code?

Best,

Dongin

Can we setup the length for Scanner.py

Hello, I read your bioRxiv paper of "Delineating genotypes and phenotypes of individual cells from longread
single cell transcriptomes" and would like to try your scNanoGPS software for our ONT single cell data.
For the step2 Scanner, you setup to scan the first and last 100 nucleotides of reads to recognize TruSeq Read 1 and PolyA by default. I wonder if we can setup first and last 150 or 200 nucleotides for scanning. Because when I use the default parameter, majority of my reads are filtered out although the sequence quality looks good.
Look forward to your reply! Thank you!

error message at the end of "5.4 Generate final summary table", low java memory.

Hello Cheng-Kai,

I come to the "5.4 Generate final summary table" process! (I skipped the 5.3 SNV profile at this moment.)
But unfortunatelly, I got another error message, and has not come to the "summary.txt"...

Below is the stdout after run the reporter_summary.py
The error message come up at the final of this process.
Please see the bottom of the stdout below.

It looks that memory which set to Java memory in qualimap process is low.
Is there any solution to this errow message? for example, assign more Java memory in qualimap?

Parsing scanner log ...
Done.

Computing read length ...
Done.

Calculating quality score ...
Done.


Merging all bam file for qualimap...

100 of 100 files...
[samfaipath] build FASTA index...
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
[samfaipath] fail to build FASTA index.
[E::sam_parse1] missing SAM header
[W::sam_read1] Parse error at line 1
[main_samview] truncated file.
Done.

/home/minoruyano/opt/local/scNanoGPS/reporter_summary.py:216: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
  expr_df = pd.read_csv(options.exp_tb, header = 0, sep = '\t', skiprows =1, compression = options.compression)
Calling Qualimap...
Java memory size is set to 1200M
Launching application...

OpenJDK 64-Bit Server VM warning: Ignoring option MaxPermSize; support was removed in 8.0
QualiMap v.2.3
Built on 2023-05-19 16:57

Selected tool: rnaseq
Wed Apr 17 16:34:34 JST 2024            WARNING Output folder already exists, the results will be saved there

Initializing regions from /home/minoruyano/reference_genome_and_annotations/Homo_sapiens.GRCh38.111.gtf...

Initialized 100000 regions...
Initialized 200000 regions...
Initialized 300000 regions...
Initialized 400000 regions...
Initialized 500000 regions...
Initialized 600000 regions...
Initialized 700000 regions...
Initialized 800000 regions...
Initialized 900000 regions...
Initialized 1000000 regions...

WARNING: out of memory!
Qualimap allows to set RAM size using special argument: --java-mem-size
Check more details using --help command or read the manual.
Traceback (most recent call last):
  File "/home/minoruyano/opt/local/scNanoGPS/reporter_summary.py", line 242, in <module>
    fh = open(os.path.join(options.tmp_dir, "master_rnaseq_qc", "rnaseq_qc_results.txt"), "rt")
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'tmp_barcode09_GRCh38dnatoplevel/master_rnaseq_qc/rnaseq_qc_results.txt'

Best regards,
Minoru

"edditing_distance" option in scanner.py

Hi Cheng-Kai,

I am sorry to bother you again, but I have other two issues of question and request.
This is the question.

What does the "editing distance" in scanner.py mean?
we are able to set the option.

  --editing_distance=ALLOW_EDITING_DISTANCE
                        Editing distance for cell barcode detection. Default: 2

ALthough I am able to understand distances in assigner and curator while calcurating them in assiging CB and collasping UMI, It is not clear for me to understand the distance while the scanner process.

I compared the scanner process with default "2", and "0" and "5". The result in log file about the numbers of reads are same. No difference between the result of the three settings.

Best regards,
Minoru

Split the fastq file into 24 chromosomes

Hi,
Thank you for developing scNanoGPS!
I have a question about the pipeline.
Due to the large amount of memory it consumes, can I split the fastq file into 24 chromosomes to run this pipeline?

Best wishes,
Kirito

curator.py error

commond:
python ~/soft/scNanoGPS/curator.py -t 50 --ref_genome ~/ref/ssc/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa
error:
Mapping reads for each individual cell barcode ...
rm: cannot remove 'tmp/GAAATGACACTCTGCT.unsorted.bam': No such file or directory
[E::hts_open_format] Failed to open file "tmp/GAAATGACACTCTGCT.minimap2.bam" : No such file or directory
Traceback (most recent call last):
File "/home/data/fuli14/soft/scNanoGPS/curator.py", line 175, in
seq_dict = curator_io.build_read_seq_dict(os.path.join(options.tmp_dir, fq_pref) + ".minimap2.bam")
File "/home/data/fuli14/soft/scNanoGPS/curator_core/curator_io.py", line 193, in build_read_seq_dict
bam_f = pysam.AlignmentFile(bam_name, "rb")
File "pysam/libcalignmentfile.pyx", line 748, in pysam.libcalignmentfile.AlignmentFile.cinit
File "pysam/libcalignmentfile.pyx", line 947, in pysam.libcalignmentfile.AlignmentFile._open
FileNotFoundError: [Errno 2] could not open alignment file tmp/GAAATGACACTCTGCT.minimap2.bam: No such file or directory

difference between the number of sum of UMI counts in CB_counting.tsv and the numbers of result counting in scanner.log.txt

Hi Cheng-kai,

I have a question about difference between the numbers of "Result counting" in scanner.log.txt and the number of sum of all UMI counts in CB_counting.tsv.

In one of my project;

The "Result counting" in scanner.log txt is as below
Number of 3'-adaptor located on the read head region: 1,850,644
Number of 3'-adaptor + polyT on the read head region: 1,815,982
Number of 3'-adaptor located on the read tail region: 1,757,142
Number of 3'-adaptor + polyT on the read tail region: 1,732,253
I regard the sum of the number of 3'-adaptor + polyT on either region is reasonable as reads which will be used in the next step, Assigner. This number is 3,548,235 (= 1,815,982 + 1,732,253)

The number of sum of all UMI counts in CB_counting.tsv is 2,824,755

Why the two numbers, 2,824,755 and 3,548,235, are different?, and what kind of stracture difference are in the reads?

long curator times

Hi Cheng-Kai,

I am running into very long times (5+ days) when running the curator step, even with increasing the thread count to 80. Is there any interest in adding a parameter that can be set to limit the number of reads that each consensus sequence is made from?

Thanks for making this pipeline, it works great otherwise!
-Evan

One UMI shared by two reads simultaneously

Hi Cheng-Kai,

I found that barcode.curated.minimap2.bam with one UMI shared by two reads simultaneously.

Normally, “The reads that share same UMIs are collapsed to generate consensus sequences of individual molecules using software, SPOA.” If I didn't understand the mistake, there should be no shared UMI reads after the curator.py step. Or if this situation is normal, can both these two reads be used for counting? But the sequences of these two reads are almost the same, and I think it's just due to PCR amplification.
Could you give some advice?

Best wishes,
Kirito

header of CB_merged_list.tsv.gz is weird when using force cell option

Hi,

When I ran "assigner.py" forced_no option, the header of CB_merged_list.tsv.gz is weird like a picture of below:

the 1:1 was added next to the header. But when I removed the forced_no option, it was fine.
For now I manually move 1:1 to the next line before going to next steps, but would it be solved?

Thanks.

Running time of scNanoGPS v1.1 assigner.py

Hi Cheng-Kai,
Thank you for updating scNanoGPS!
I used scNanoGPS-1.1 and found that it is much faster in assigner.py compared to scNanoGPS-1.0. It may have taken a week before, but now it only takes nearly 2 hours. Is this normal?
![image](https://github.com/gaolabtools/scNanoGPS/assets/87375686/e2bcd29b-8f78-4285-beca-05d835dfc166)
Here is my code:
`python $scNanoGPS_dir/assigner.py -i $Output_dir/barcode_list.tsv.gz -d $Output_dir -t 35`
I don't think the new parameters will have any issues with my original code.

Thank you for your reply!
Best wishes,
Kirito

Originally posted by @kir1to455 in #8 (comment)

IsoSeq and MASseq

Hello,
Can scNanoGPS be used with data generated by PacBio platforms such as scIsoSeq or MASseq?

Thank you
Yoav

Minimap2 parameters and secondary alignments

I used scNanoGPS to analyze Nanopore data with 10X Visium cDNA. The results looked good However I have a few questions about the Minimap2 aligner and what parameter to use
The scNanoGPS pipeline by default uses --ax splice in minimap2. However when I use that, some transcripts identified by 10X Visium as primary alignments seem to be secondary alignments/multimapping with minimap2 (although I would expect more confident mapping with long reads)
For some transcripts, the strandedness does not seem to be consistent between 10X Visium and Minimap2
- I tried using --ax map-ont as well. But here some important transcripts I found with Visium and --ax splice are not found.
- We also did qPCR for some transcripts and the expression seems more consistent with that of 10X Visium and alignments from --ax splice. The primers designed were very specific to the transcripts with no off-targets elsewhere but don't understand why minimap does not interpret them as primary alignments.

      I also read that splice is more suitable for RNA-seq data while map-ont for genomic data. However, if I use --ax splice I am not sure how to justify/interpret the multimapping and strand inconsistency. Could you please advise on these?

Thanks,
Prakrithi

Is the toolkit only compatible with 3 prime adapters?

I was testing scNanoGPS on some 5 prime data that I have and noticed that it is only picking up <10% of reads at the scanner step, and only reporting on the 3 prime adapter identified. It's not readily clear, but from this observation and the manuscript it seems that it's currently only working for 3 prime data?

Curator script taking huge time

Hi, I am using scNanoGPS. I have finished the initial steps and am now running the curator.py script. I has been running and it seems it will take lot of time to finish.

Fastq file size - 109GB

Commands
python3 scanner.py -i fastq_pass/ -d scNanoGPS_res -t 16

python3 ../../assigner.py -i barcode_list.tsv.gz -d scNanoGPS_res -t 16 --forced_no 10000

python3 ../../curator.py -d scNanoGPS_res -t 16 --ref_genome Homo_sapiens.GRCh38.dna.toplevel.fa --keep_meta 1

Is there any way to reduce the processing time? Is my run going well?

bioconda recipe?

conda is leveraged for installing (some) dependencies (at least, the recommended install method), but conda is not fully leveraged via a bioconda recipe for installing scNanoGPS.

Are there plans on creating a bioconda recipe for scNanoGPS in order to simplify the install to just:

mamba install bioconda::scnanogps

A template created via ChatGPT:

package:
  name: scNanoGPS
  version: "1.0.0"

source:
  git_url: https://github.com/gaolabtools/scNanoGPS.git
  git_rev: main

build:
  number: 0
  script: "{{ PYTHON }} -m pip install . --no-deps --ignore-installed -vv"

requirements:
  build:
    - {{ compiler('c') }}
    - {{ compiler('cxx') }}
    - python
    - pip

  host:
    - python >=3.7
    - biopython >=1.79
    - distance >=0.1.3
    - matplotlib >=3.5.2
    - pandas >=1.4.2
    - pysam >=0.19.0
    - seaborn >=0.11.2
    - liqa

  run:
    - python >=3.7
    - biopython >=1.79
    - distance >=0.1.3
    - matplotlib >=3.5.2
    - pandas >=1.4.2
    - pysam >=0.19.0
    - seaborn >=0.11.2
    - liqa

test:
  imports:
    - scNanoGPS
  commands:
    - scNanoGPS --help

about:
  home: https://github.com/gaolabtools/scNanoGPS
  license: MIT
  summary: "A pipeline for single-cell nanopore long-read sequencing data analysis."
  description: "scNanoGPS is a pipeline designed for the analysis of single-cell nanopore long-read sequencing data, focusing on mapping, collapsing reads, and summarizing gene expression, isoforms, and SNVs."

extra:
  recipe-maintainers:
    - your-github-username

Running isoform quantification when skipping curation step

Hi,

Thanks for developing this great tool! I have a quick question regarding the output of reporter_isoform.py when curation is skipped.

First, even though I skipped the curation step, there still are *.curated.minimap2.bam in the temporary directory. Is this the expected behavior?

Second, after the reporter isoform step successfully finishes, the output table is a bit difficult to interpret. Could you explain the row names and what each individual value means in the matrix? I think the ENST* part is transcript ID but could you clarify?

Here is a screenshot of what it looks like for the first five columns and the first twenty rows.

Thanks so much!
Charles

Error while testing currator.py

Hi @gaobio,
Thank you for developing scNanoGPS pipeline!
I am wondering if you can advise why I get this error when trying to run the test example:

Mapping reads for each individual cell barcode ... [E::hts_open_format] Failed to open file tmp/GCTTACCTCAGGTCCA.unsorted.bam samtools sort: can't open "tmp/GCTTACCTCAGGTCCA.unsorted.bam": No such file or directory [E::hts_open_format] Failed to open file tmp/GCTTACCTCAGGTCCA.minimap2.bam samtools index: failed to open "tmp/GCTTACCTCAGGTCCA.minimap2.bam": No such file or directory [E::hts_open_format] Failed to open file tmp/GCTTACCTCAGGTCCA.minimap2.bam Traceback (most recent call last): File "curator.py", line 173, in <module> os.path.join(options.tmp_dir, fq_pref[0]) + ".high_softclipping.bam") File "/export/home1/ScNaUmi-seq_B2022/scNanoGPS/curator_core/curator_io.py", line 44, in filter_softclipping bam_i = pysam.AlignmentFile(input_bam, "rb") File "pysam/libcalignmentfile.pyx", line 741, in pysam.libcalignmentfile.AlignmentFile.__cinit__ File "pysam/libcalignmentfile.pyx", line 940, in pysam.libcalignmentfile.AlignmentFile._open FileNotFoundError: [Errno 2] could not open alignment file 'tmp/GCTTACCTCAGGTCCA.minimap2.bam': No such file or directory

This is the Minimap2 output log for the first cell:

[WARNING]�[1;31m Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index.�[0m
[M::main::0.249*1.02] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.322*1.02] mid_occ = 238
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.374*1.01] distinct minimizers: 5012420 (87.89% are singletons); average occurrences: 1.472; average spacing: 6.888; total length: 50818468
[M::worker_pipeline::1.114*3.34] mapped 229 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -ax splice -t 8 example/GRCh38_chr22.mmi tmp/GCTTACCTCAGGTCCA.fastq.gz
[M::main] Real time: 1.130 sec; CPU: 3.740 sec; Peak RSS: 1.324 GB

I have correctly installed all required dependencies. BTW, This example works on my local machine (macOS) but not in my Linux server.

I would be very grateful if you could provide a Docker file or a complete conda env file!

Thank you
Ali

reporter_SNV problem

Hi,

Thanks for making the scNanoGPS. I can successfully run reporter_expression and reporter_isoform. but when run the reporter_SNV, i got below error. please advice how to fix it. may i know which tabix version you are using. i originally use bcftools 1.20, it does not work. then i downgrade to bcftools 1.15. then i got error as below.

"""
python3 reporter_SNV.py -t 2 --ref_genome example/GRCh38_chr22.fa