luyitian / flames Goto Github PK

View Code? Open in Web Editor NEW

68.0 4.0 10.0 3.32 MB

Full-length transcriptome splicing and mutation analysis

License: GNU General Public License v3.0

Python 65.83% C++ 16.80% C 17.36%

flames's Introduction

FLAMES

Full-length transcriptome splicing and mutation analysis

Installation

The easiest way to install the dependencies for this package is using a conda environment. The scripts can then be cloned from git.

conda create -n FLAMES \
    "python>=3.7" samtools pysam minimap2 numpy editdistance \
    -c bioconda -c conda-forge
git clone https://github.com/LuyiTian/FLAMES.git

Usage for bulk data

Before using this software, remember to activate the FLAMES conda environment. The main scripts are the pipelines for single cells and bulk samples.

conda activate FLAMES
FLAMES/python/sc_long_pipeline.py --help
FLAMES/python/bulk_long_pipeline.py --help

An example has been included with a small subset of SIRV data in the examples folder.

PATH=$PWD/FLAMES/python:$PATH

cd examples/SIRV
bulk_long_pipeline.py \
    --gff3 data/SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf \
    --genomefa data/SIRV_isoforms_multi-fasta_170612a.fasta \
    --outdir FLAMES_output \
    --config_file data/SIRV_config.json \
    --fq_dir data/fastq

# output data is in FLAMES_output
ls FLAMES_output

Usage for single cell data

For single cell, the first step is to find cell barcode in each long read. You can use a precompiled linux binary in src/bin/match_cell_barcode. If you encontered error, espically due to the different running environment. you can compile it from source code using the following command. Please use a C++ compiler that support C++11 features.

g++ -std=c++11 -lz -O2 -o match_cell_barcode ssw/ssw_cpp.cpp ssw/ssw.c match_cell_barcode.cpp kseq.h edit_dist.cpp

Then you can run ./match_cell_barcode without argument to print the help message. match_cell_barcode requires a folder that contains all fastq file (can be .gz file), a file name/path for the statistics of barcode matching, a csv file of cell barcode that used as reference and a file name/path to the output fastq.gz file. The cell barcode for 10x will be in filtered_feature_bc_matrix/barcodes.tsv.gz, please unzip the file and use barcodes.tsv as input. It also support scPipe cell barcode annotation file generated from sc_detect_bc function. In match_cell_barcode, the flanking sequence (CTACACGACGCTCTTCCGATCT) is aligned to the first 30000 reads to identify the regions where cell barcode is likely to be found within. Next, sequences within this region are matched to barcodes in barcodes.tsv, allowing MAX_DIST (the 5th argument of match_cell_barcode) hamming distances. Reads that are successfully matched with a barcode are reported in the barcode hm match count. Reads that could not be matched in the previous step are aligned to the flanking sequence to identify the location of barcode individually, and barcode matching is done with up to MAX_DIST levenshtein distances (allowing indels). Reads that are matched by this step is reported by the fuzzy match count.

Next, after you get the fastq file from match_cell_barcode. you could run sc_long_pipeline.py with the following command.

usage: FLTSA [-h] -a GFF3 [-i INFQ] [-b INBAM] --outdir OUTDIR --genomefa
             GENOMEFA --minimap2_dir MINIMAP2_DIR [--config_file CONFIG_FILE]
             [--downsample_ratio DOWNSAMPLE_RATIO]

# semi-supervised isoform detection and annotation from long read data.
# output:
# outdir:
#   transcript_count.csv.gz   // transcript count matrix
#   isoform_annotated.filtered.gff3 // isoforms in gff3 format
#   transcript_assembly.fa // transcript sequence from the isoforms
#   align2genome.bam       // sorted bam file with reads aligned to genome
#   realign2transcript.bam // sorted realigned bam file using the
#                            transcript_assembly.fa as reference
#   tss_tes.bedgraph       // TSS TES enrichment for all reads (for QC)
################################################################

optional arguments:
  -h, --help            show this help message and exit
  -a GFF3, --gff3 GFF3  The gene annotation in gff3 format.
  -i INFQ, --infq INFQ  input fastq file.
  -b INBAM, --inbam INBAM
                        aligned bam file (should be sorted and indexed). it
                        will overwrite the `--infq` parameter and skip the
                        first alignment step
  --outdir OUTDIR, -o OUTDIR
                        directory to deposite all results in rootdir, use
                        absolute path
  --genomefa GENOMEFA, -f GENOMEFA
                        genome fasta file
  --minimap2_dir MINIMAP2_DIR, -m MINIMAP2_DIR
                        directory contains minimap2, k8 and paftools.js
                        program. k8 and paftools.js are used to convert gff3
                        to bed12.
  --config_file CONFIG_FILE, -c CONFIG_FILE
                        json configuration files (default
                        config_sclr_nanopore_default.json)
  --downsample_ratio DOWNSAMPLE_RATIO, -d DOWNSAMPLE_RATIO
                        downsampling ratio if performing downsampling analysis

configuration file

FLAMES provides a default set of parameters, but can be changed by the configuration JSON file. The pipeline_parameters section specifies which step to be excuated in the pipeline, by default you should go though all steps. The isoform_parameters section determines the results isoform detection, some key parameters include Min_sup_cnt which means transcript with less read aligned than Min_sup_cnt will be discarded, MAX_TS_DIST which will merge transcripts with the same intron chain and TSS/TES distance less than MAX_TS_DIST. strand_specific will specify whether the the read is strand specific, such as the reads are in the same strand as the mRNA (1) or the reverse complement (-1), or the reads are not strand specific (0), which means the method will determine the strand information based on reference annotation.

flames's People

Contributors

Stargazers

Watchers

Forkers

shians bill125 xueyidong ryanyip-kat hy-yang changqingw jchang97 wenmm olivervoogd maxim-h

flames's Issues

Demultiplexing issues

Hi Luyi,

The pipeline is great! Thanks for the effort and for sharing it.

I have tried FLAMES on your published data and our own in-house data, and have two questions:

For match_cell_barcode, the "output cell barcode statistics file" always miss the first barcode in the "whitelist" file, is this a bug?
For single-cell long-read data, when poly-A tail is in the read, match_cell_barcode should search for the barcode and UMI in the suffix instead of the prefix of the read, right? I did find some cases that match_cell_barcode still searched and trimmed the prefix.

Looking forward to your feedback.

Thanks,
Yan

transcript id name

Hi, thanks for developing FLAMES, very nice tool.

One question about the transcript_count.csv.gz output, I got the result like this:

Where the transcript id name is quite weird, do you know how to solve it? Thanks!

match_cell_barcode qnames too long

Minimap2/Samtools is throwing an error from reads with append cell barcode/UMI (generated from match_cell_barcode).

[E::sam_parse1] query name too long
[W::sam_read1_sam] Parse error at line 8760987
samtools sort: truncated file. Aborting

Here is an example qname:

@CTACGGGAGAGCTTTC_CGATAAGACCCA#ACATCGAGTCAAACGG_GCACATCTTGGC#GTAGAGGAGCGGGTTA_AGGCACCTATGT#AGTACTGAGAGTCAGC_CTCAGCCAGTAA#TGTCCCAGTTACCGTA_ATCGTACCAGTC#AATCGTGTCGACATCA_ACTCAAGGCCAT#CGAGAAGGTTCGGCGT_TACGCCAGTCTG#GCTGCAGCACATGGTT_TGATTATGCCTC#CCGTAGGCAGACTGCC_CTCTCGCATACA#TAAGTCGCAGGAGGTT_TAACTATTTACG#TCGTAGATCACTACGA_AGACGCAAATTT#GTCGAATAGGTTACAA_ACAAATTGTTTC#ACAAGCTCAGGCGTTC_CGTTGCCTATAT#GTGCACGAGGATAATC_CAGGAGTCAGAA#AGGATAAAGGTATCTC_CCAATCGCTTTA#GTCATGAGTCCTCCTA_AGCTCAAACACT#GACTTCCCAAAGTATG_GCCCACTTGCTG#TGTACAGTCAACCGAT_TGAAGCATCCAC#TGAGGTTTCAAGGACG_GGACCAAGTCGG#TTACGCCCAGCCATTA_AATCACCGCTCG#ATATCCTCACAATGAA_AATTATCTCTTT#CCACACTCAATAGGGC_CACCTATTTTTT#TCTCTGGCAAACACGG_GCCCCTGCATAG#ATATCCTGTATTCCGA_AATTATGAACTT#TCCCATGGTTGCGGAA_AAATTACAATCC#AGTAGTCTCGTCTCAC_CCATGATTCACG#CTAACCCGTGGCCTCA_ATTTACAGATGA#32fd44aa-9033-40d6-a233-bf43ece68751

Looks like qname must be equal to or shorter than 254 characters: samtools/samtools#1081

Config file parameters

Hi,
I am currently using FLAMES and a few other assemblers (flair and bookend), to compare them against each other and find out, which would be the most optimal one for my data and workflow (drosophila nanopore-sequences). Currently I am facing the issue, that my FLAMES-based transcriptomes are surprisingly small (after correction and filtering roughly 2500 isoforms against flairs 16000), even with the same references and sequencing files. I think, this may be due to the config file, that I honestly just copied from the github. What would you recommend as parameters/what should be changed to perhaps solve this? Would it be appropriate to be less strict and how would I enforce this in the config file?

Best,
Hasan.

Cluster annotation file

Hi there,

Thank you for creating this amazing tool!

I am trying to utilize the DTU analysis script from the FLTseq_data directory, and I am just wondering how I can get the cluster_annotation.csv file?

(Line 80-82)

cluster_barcode_anno <- read.csv(file.path(data_dir,"cluster_annotation.csv"), stringsAsFactors=FALSE)
  rownames(cluster_barcode_anno) = cluster_barcode_anno$barcode_seq
  comm_cells = intersect(colnames(tr_sce),rownames(cluster_barcode_anno))

Thank you

Adapt match_cell_barcode to custom Barcode and UMI Length

Hi,

Thanks for developing FLAMES.

I have a specific requirement that involves adapting match_cell_barcode function to accommodate different barcode and UMI lengths. Currently, the software assumes a standard barcode length of 16 and a UMI length of 10, based on 10X kit.
I would like to request if it would be possible to modify these parameters according to my needs, since I’m using a custom single-cell ONT library with same flanking sequence (CTACACGACGCTCTTCCGATCT) but barcode length of 11 and UMI of 14 bp respectively.

Regarding the UMI length, it can be specified by command line.
I would like to ask you a feedback:

I modified the source code of match_cell_barcode by substituting all ’16’ occurrences with ’11’.
For the UMI I specified by command line my length (14).
Edit distance allowed : 2 (considering that I have a minimum hamming distance of 3 between my custom barcodes).

Are these modifications correct and sufficient in order to have a proper barcode and UMI assignment, or do I have to change something else in the source code of match_cell_barcode?

Sorry in advance but I’m not an expert in c++.
Best

How to use multiple cores

Hi, is there any way to specify the number of cores for the single cell run so we can execute it faster like on a dataset with > 20 millions of reads ?

Demultiplexing

Can this pipeline also demultiplex reads from cell barcodes?

error in transcript quantification step

Hi, I am getting this error in the final counts matrix generation step:

does anyone know how to circumvent this issue?

b'[bam_sort_core] merging from 9 files and 12 in-memory blocks...\n'
b''
### generate transcript count matrix 2023-12-08 17:21:24
Traceback (most recent call last):
  File "/users/sparthib/flames/python/bulk_long_pipeline.py", line 270, in <module>
    bulk_long_pipeline(args)
  File "/users/sparthib/flames/python/bulk_long_pipeline.py", line 235, in bulk_long_pipeline
    bc_tr_count_dict, bc_tr_badcov_count_dict, tr_kept = parse_realigned_bam(
  File "/users/sparthib/flames/python/count_tr.py", line 114, in parse_realigned_bam
    bc_dict = make_bc_dict(kwargs["bc_file"])
  File "/users/sparthib/flames/python/count_tr.py", line 57, in make_bc_dict
    with open(bc_anno) as f:
FileNotFoundError: [Errno 2] No such file or directory: ''

thanks!

Sowmya

Compilation issue for match_cell_barcode

I get the following error when trying to compile match_cell_barcode.

$ g++ -std=c++11 -lz -O2 -o match_cell_barcode ssw/ssw_cpp.cpp ssw/ssw.c match_cell_barcode.cpp kseq.h edit_dist.cpp

Error

edit_dist.cpp: In function ‘unsigned int edit_distance_bpv(T&, const int64_t*, const size_t&, const unsigned int&, const unsigned int&)’:
edit_dist.cpp:50:5: error: ‘uint64_t’ was not declared in this scope
   50 |     uint64_t top = (1LL << (tlen - 1));
      |     ^~~~~~~~
edit_dist.cpp:2:1: note: ‘uint64_t’ is defined in header ‘<cstdint>’; did you forget to ‘#include <cstdint>’?
    1 | #include "edit_dist.h"
  +++ |+#include <cstdint>
    2 |
edit_dist.cpp:51:13: error: expected ‘;’ before ‘lmb’
   51 |     uint64_t lmb = (1LL << 63);
      |             ^~~~
      |             ;
edit_dist.cpp:62:21: error: expected ‘;’ before ‘X’
   62 |             uint64_t X = PM[r];
      |                     ^~
      |                     ;
edit_dist.cpp:63:38: error: ‘lmb’ was not declared in this scope
   63 |             if(r > 0 && (HN[r - 1] & lmb)) X |= 1LL;
      |                                      ^~~
edit_dist.cpp:63:44: error: ‘X’ was not declared in this scope                                                                                        [94/1081]
   63 |             if(r > 0 && (HN[r - 1] & lmb)) X |= 1LL;
      |                                            ^
edit_dist.cpp:64:24: error: ‘X’ was not declared in this scope
   64 |             D0[r] = (((X & VP[r]) + VP[r]) ^ VP[r]) | X | VN[r];
      |                        ^
edit_dist.cpp:68:38: error: ‘lmb’ was not declared in this scope
   68 |             if(r == 0 || HP[r - 1] & lmb) X |= 1LL;
      |                                      ^~~
edit_dist.cpp:70:38: error: ‘lmb’ was not declared in this scope
   70 |             if(r > 0 && (HN[r - 1] & lmb)) VP[r] |= 1LL;
      |                                      ^~~
edit_dist.cpp:73:23: error: ‘top’ was not declared in this scope
   73 |         if(HP[tmax] & top) ++D;
      |                       ^~~
edit_dist.cpp: At global scope:
edit_dist.cpp:82:5: error: ‘uint64_t’ does not name a type
   82 |     uint64_t arr_[N];
      |     ^~~~~~~~
edit_dist.cpp:82:5: note: ‘uint64_t’ is defined in header ‘<cstdint>’; did you forget to ‘#include <cstdint>’?
edit_dist.cpp:83:5: error: ‘uint64_t’ does not name a type
   83 |     uint64_t & operator[](size_t const &i) {
      |     ^~~~~~~~
edit_dist.cpp:83:5: note: ‘uint64_t’ is defined in header ‘<cstdint>’; did you forget to ‘#include <cstdint>’?
edit_dist.cpp: In instantiation of ‘unsigned int edit_distance_map_(const int64_t*, size_t, const int64_t*, size_t) [with long unsigned int N = 1; int64_t = lo
ng int; size_t = long unsigned int]’:
edit_dist.cpp:113:48:   required from here
edit_dist.cpp:95:59: error: no match for ‘operator[]’ (operand types are ‘std::map<long int, varr<1>, std::less<long int>, std::allocator<std::pair<const long
int, varr<1> > > >::mapped_type’ {aka ‘varr<1>’} and ‘size_t’ {aka ‘long unsigned int’})
   95 |         for(size_t j = 0; j < 64; ++j) cmap[a[i * 64 + j]][i] |= (1LL << j);
      |                                        ~~~~~~~~~~~~~~~~~~~^
edit_dist.cpp:97:60: error: no match for ‘operator[]’ (operand types are ‘std::map<long int, varr<1>, std::less<long int>, std::allocator<std::pair<const long
int, varr<1> > > >::mapped_type’ {aka ‘varr<1>’} and ‘unsigned int’)
   97 |     for(size_t i = 0; i < tlen; ++i) cmap[a[tmax * 64 + i]][tmax] |= (1LL << i);
      |                                      ~~~~~~~~~~~~~~~~~~~~~~^
edit_dist.cpp: In instantiation of ‘unsigned int edit_distance_map_(const int64_t*, size_t, const int64_t*, size_t) [with long unsigned int N = 2; int64_t = lo
ng int; size_t = long unsigned int]’:
edit_dist.cpp:114:53:   required from here
edit_dist.cpp:95:59: error: no match for ‘operator[]’ (operand types are ‘std::map<long int, varr<2>, std::less<long int>, std::allocator<std::pair<const long
int, varr<2> > > >::mapped_type’ {aka ‘varr<2>’} and ‘size_t’ {aka ‘long unsigned int’})
   95 |         for(size_t j = 0; j < 64; ++j) cmap[a[i * 64 + j]][i] |= (1LL << j);
      |                                        ~~~~~~~~~~~~~~~~~~~^
edit_dist.cpp:97:60: error: no match for ‘operator[]’ (operand types are ‘std::map<long int, varr<2>, std::less<long int>, std::allocator<std::pair<const long
int, varr<2> > > >::mapped_type’ {aka ‘varr<2>’} and ‘unsigned int’)
   97 |     for(size_t i = 0; i < tlen; ++i) cmap[a[tmax * 64 + i]][tmax] |= (1LL << i);
      |                                      ~~~~~~~~~~~~~~~~~~~~~~^
edit_dist.cpp: In function ‘unsigned int scutil::edit_distance1(const int64_t*, unsigned int, const int64_t*, unsigned int)’:
edit_dist.cpp:115:1: warning: control reaches end of non-void function [-Wreturn-type]
  115 | }
      | ^

Here's some info about my environment, if it's relevant.

Environment information

$ g++ --version
g++ (GCC) 13.2.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ ldd --version
ldd (GNU libc) 2.28
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

$ uname -a
Linux hostname 4.18.0-477.13.1.el8_8.x86_64 #1 SMP Tue May 30 22:15:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Applying FLAMES to PacBIO

Hello,

I was trying to use FLAMES in a isoform characterization benchmarking study with a single sample but, since I am new with the long-read world, it is not clear to me yet which are the key parameters that I need to consider in the configuration file. After running FLAMES i found my isoform_filtered gff3 file almost empty. This is my output data:

        SIZE           DATE              FILE

2444550950 Jul 13 19:28 align2genome.bam
3252184 Jul 13 19:28 align2genome.bam.bai
16 Jul 13 19:43 isoform_annotated.filtered.gff3
15122175 Jul 13 19:32 isoform_annotated.gff3
61 Jul 13 19:43 isoform_FSM_annotation.csv
3534807677 Jul 13 18:11 merged.fastq.gz
59 Jul 13 18:11 pseudo_barcode_annotation.csv
1505577353 Jul 13 19:41 realign2transcript.bam
3221800 Jul 13 19:41 realign2transcript.bam.bai
98666092 Jul 13 19:33 transcript_assembly.fa
2062401 Jul 13 19:33 transcript_assembly.fa.fai
118564 Jul 13 19:42 transcript_count.bad_coverage.csv.gz
186937 Jul 13 19:42 transcript_count.csv.gz
3617886 Jul 13 19:32 tss_tes.bedgraph

My input parameters and data was:
--gff3 gencode.v40.annotation.gtf (human annotations)
--genomefa GRCh38.primary_assembly.genome.fa. (human reference genome)
--outdir FLAMES_output/
--fq_dir fastq/ (path to my directory containing my unique fastq file)

I am not using any configuration file so FLAMES is applying other parameters by default and I guess this is the main problem for me since it is designed for ONT. So my question would be, which are the best parameters for running an analysis with PacBio files? Which are your recommendations?

Here I paste a config file I used for ONT data so you indicate if this is everything I need to correct or, apart from correcting these parms for PacBio there is extra params to consider.

"pipeline_parameters":{
"do_genome_alignment":true,
"do_isoform_identification":true,
"do_read_realignment":true,
"do_transcript_quantification":true
},
"global_parameters":{
"generate_raw_isoform":false,
"has_UMI":false
},
"isoform_parameters":{
"MAX_DIST":10,
"MAX_TS_DIST":120,
"MAX_SPLICE_MATCH_DIST":10,
"min_fl_exon_len":40,
"Max_site_per_splice":3,
"Min_sup_cnt":10,
"Min_cnt_pct":0.001,
"Min_sup_pct":0.2,
"strand_specific":0,
"remove_incomp_reads":5
},
"alignment_parameters":{
"use_junctions":true,
"no_flank":false
},
"realign_parameters":{
"use_annotation":true
},
"transcript_counting":{
"min_tr_coverage":0.3,
"min_read_coverage":0.3
}
}

Thank you very much for your help in advance and my apologies for such basic question!
Best,
AP

config file

Hi,
Nice work! Congrats!

Two questions:

1- Can I use the config file from the example ("SIRV_config.json") to run my human datasets?

2- Also, I could not activate the environment after installing your software, the error below, and I guess I still can run it without activating the env. Is that correct? I am not getting any error if I run your software without activating the env.

I tried to export the env path and also ran "conda.sh" before running the command with no luck.
Here is what I get if I try to activate the env:

conda activate FLAMES

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.

empty transcript_assembly.fa file using example SIRV data

Hi,
I tried running the example in the example folder of the python installation of FLAMES which gives an error:
subprocess.CalledProcessError: Command '['samtools faidx FLAMES_output/transcript_assembly.fa']' returned non-zero exit status 1.
I noticed that the transcript_assembly.fa file is empty. In the get_transcript_seq function in gff3_to_fa.py, it exits the first for loop right away as the following statement is false: if ch not in chr_to_gene:. However, it also does not enter the next for loop (for tr_seq in global_seq_dict:) because the dictionary is empty. I'd really appreciate your help.

GTF format error?

Do I need a specific GFF/GTF format?

I am getting this error:

Traceback (most recent call last):
  File "/FLAMES/python/bulk_long_pipeline.py", line 243, in <module>
    bulk_long_pipeline(args)
  File "/FLAMES/python/bulk_long_pipeline.py", line 171, in bulk_long_pipeline
    gff3_to_bed12(args.minimap2_dir, args.gff3, tmp_bed)
  File "/FLAMES/python/minimap2_align.py", line 17, in gff3_to_bed12
    print subprocess.check_output([cmd], shell=True, stderr=subprocess.STDOUT)
  File "/miniconda/lib/python2.7/subprocess.py", line 223, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['paftools.js gff2bed /gnet/is6/p04/data/dnaseq/analysis/led13/genomes/GCA_000001405.15_GRCh
38_full_analysis_set.refseq_annotation.gtf > /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6310_q10_l300_flames/tmp.splice_anno.
bed12']' returned non-zero exit status 1

here is the head of my GTF file:

#gtf-version 2.2
#!genome-build GRCh38
#!genome-build-accession NCBI_Assembly:GCA_000001405.15
#!annotation-date 01/25/2019
#!annotation-source NCBI Homo sapiens Updated Annotation Release 109.20190125
chr1    BestRefSeq      gene    11874   14409   .       +       .       gene_id "DDX11L1"; db_xref "GeneID:100287102"; db_xref "HGNC
:HGNC:37102"; description "DEAD/H-box helicase 11 like 1"; gbkey "Gene"; gene "DDX11L1"; gene_biotype "transcribed_pseudogene"; pseu
do "true";
chr1    BestRefSeq      exon    11874   12227   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "Gen
eID:100287102"; gbkey "misc_RNA"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1"; exon_number "1";
chr1    BestRefSeq      exon    12613   12721   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "Gen
eID:100287102"; gbkey "misc_RNA"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1"; exon_number "2";
chr1    BestRefSeq      exon    13221   14409   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "Gen
eID:100287102"; gbkey "misc_RNA"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1"; exon_number "3";
chr1    BestRefSeq      gene    14362   29370   .       -       .       gene_id "WASH7P"; db_xref "GeneID:653635"; db_xref "HGNC:HGN
C:38034"; description "WAS protein family homolog 7, pseudogene"; gbkey "Gene"; gene "WASH7P"; gene_biotype "transcribed_pseudogene

FSM and FSM-match to ref

Ciao Luyi

Thanks again for nice work,
could you please let me know what's the difference between FSM (which is based on the definition of SQANTI isoformas matched with reference in all splicing junction) and FSM_annotation file at the output of the flames? in this annotation file there I column which is FSM-match to ref, if this is the list of all FSM Isoforms , then what this col7umn tell us?

Thanks
Iman

No output from match_cell_barcode

Hi,

Thank you for sharing FLAMES!

I think I have successfully run the match_cell_barcode as I got the information pasted below.

However, I didn't get any output, neither matched fastq nor barcode statistic. Can you please comment on that?

Many thanks!
Yanming

Missing mitochondrial transcripts in isoform_annotated.gff3

Hi,

first, thanks a lot for developing FLAMES!

I have one question about the configuration parameters and a problem regarding some missing genes/transcripts in the final FLAMES output and would really appreciate some help.

i) First, I was wondering if there is any further explanation for the different isoform parameters that can be adapted in the config file? I have an idea about some of the parameters (MAX_DIS, MAX_TS_DIST, Min_sup_cnt, strand_specific) but I would really appreciate a bit more detail about how the others impact the isoform identification step.

ii) Moreover, I noticed that some of the chromosomes/regions I was providing in the gene annotation reference were not part of the final FLAMES output. I'm using a slightly adapted gtf and fasta file that doesn't only contain human genes but also some pathogens. However, even though reads map against those genes, not a single transcript isoform for those genes is written into the isoform_annotated.gff3 and transcript_assembly.fa. Also, no mitochondrial transcripts are detected.
I checked the number of reads mapping to those regions in the align2genome.bam with samtools idxstats align2genome.bam and at least for the mitochondrial genes, a lot of reads are mapping.

However, only those seqnames are included in the isoform_annotated.gff3:
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '3', '4', '5', '6', '7', '8', '9', 'GL000191.1', 'GL000192.1', 'GL000194.1', 'GL000195.1', 'GL000218.1', 'GL000219.1', 'GL000223.1', 'X', 'Y']

Are they filtered out due to the parameters specified in the configuration or is something else happening here? It would be great to have information about those genes and transcripts as well.

Thanks a lot!

Best,
Kristin

Using transcript_count.csv.gz matrix with popular analysis tools

I'm struggling to convert transcript_count.csv.gz matrix to a Seurat or AnnData object? Any help and advice would be appreciated.

Transcript information

Hi,

Thank you for development of nice tool.
I'm applying BLAZE and FLAMSE to my single cell ONT data.
I've gotten useful output, but I need a genomic coordinate for each transcript to compare the transcript structure.
However, some of transcripts in the "transcript_count.csv.gz" are not existed in the "isoform_annotated.gff3" and "isoform_annotated.filtered.gff3". How can I find the information for these transcripts?

Thank you!

Facing error with minimap2 when running single cell long pipeline.py

I am running the sc_long_pipeline.py as a rule in my snakemake workflow. The following is a command that I am using

    applications/flames/python/sc_long_pipeline.py \
    -a {params.annotation_gff3} \
    -i {input.flames_cb_matched} \
    -o {output.flames_isoform_dir} \
    -f {params.reference} \
    -c {params.config_file} \
    # -m {params.minimap_dir}

Log File Output
Activating conda environment: .snakemake/conda/3731e9db73179dc68eabd2b463b92ab3_
Use config file: reference/config_default.json
Parameters in configuration file:
comment : this is the default config for nanopore single cell long read data using 10X RNA-seq kit. use splice annotation in alignment.
pipeline_parameters
do_genome_alignment : True
do_isoform_identification : True
do_read_realignment : True
do_transcript_quantification : True
global_parameters
generate_raw_isoform : False
has_UMI : True
isoform_parameters
MAX_DIST : 10
MAX_TS_DIST : 120
MAX_SPLICE_MATCH_DIST : 10
min_fl_exon_len : 40
Max_site_per_splice : 3
Min_sup_cnt : 5
Min_cnt_pct : 0.001
Min_sup_pct : 0.2
strand_specific : 0
remove_incomp_reads : 4
random_seed : 666666
alignment_parameters
use_junctions : True
no_flank : False
seed : 2022
realign_parameters
use_annotation : True
transcript_counting
min_tr_coverage : 0.4
min_read_coverage : 0.4
output directory not exist, create one:
results/flames/t1_shCTRL/isoform
Input parameters:
gene annotation: reference/gencode.vM10.annotation.gff3
genome fasta: reference/GRCm38.p4.genome.fa
input fastq: results/flames/t1_shCTRL/t1_shCTRL_cb_matched.fastq.gz
output directory: results/flames/t1_shCTRL/isoform
directory contains minimap2:

align reads to genome using minimap2 2024-07-16 17:03:24

b''
Traceback (most recent call last):
File "applications/flames/python/sc_long_pipeline.py", line 240, in
sc_long_pipeline(args)
File "applications/flames/python/sc_long_pipeline.py", line 168, in sc_long_pipeline
seed=config_dict["alignment_parameters"]["seed"])
File "/condo/brannanlab/tmhaxs421/STAMP/Luiz-LongRead/applications/flames/python/minimap2_align.py", line 41, in minimap2_align
shell=True, stderr=subprocess.STDOUT))
File "/condo/brannanlab/tmhaxs421/STAMP/Luiz-LongRead/.snakemake/conda/3731e9db73179dc68eabd2b463b92ab3_/lib/python3.7/subprocess.py", line 411, in check_output
**kwargs).stdout
File "/condo/brannanlab/tmhaxs421/STAMP/Luiz-LongRead/.snakemake/conda/3731e9db73179dc68eabd2b463b92ab3_/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['minimap2 -ax splice -t 12 --junc-bed results/flames/t1_shCTRL/isoform/tmp.splice_anno.bed12 --junc-bonus 1 -k14 --secondary=no --seed 2022 reference/GRCm38.p4.genome.fa results/flames/t1_shCTRL/t1_shCTRL_cb_matched.fastq.gz | samtools view -bS -@ 4 -m 2G -o results/flames/t1_shCTRL/isoform/tmp.align.bam - ']' returned non-zero exit status 127.

Can someone tell me why I am facing this error? I have tried a lot of alternatives. None seem to work.

How to know how many reads are assigned to the barcode ?

By setting the edit distance and with the barcode list, some of the reads should be removed from consideration. How can I know the number of reads that are assigned to the barcodes in flames output?

Run FLAMES directly from an aligned bam file.

Could we run FLAMES directly from a bam file which is generated by other demultiplex tool (i.e. Nanopore/sockeye)? Actually, I have tried once, but failed. It seems that fastq file is required in realign step. Could you please give me some advice if we have no short-read sequencing data but want to use FLAMES for isoform analysis? Thanks so much!

sc_long_pipeline.py--> ValueError: invalid contig `chr1`

Hi, both my genome.fa and gff3 files use contig chr1. Is there support for this format or parameters I can set to solve this error?

Traceback (most recent call last):
File "PATH/TO/FLAMES/python/sc_long_pipeline.py", line 240, in
sc_long_pipeline(args)
File "PATH/TO/FLAMES/python/sc_long_pipeline.py", line 193, in sc_long_pipeline
raw_gff3=raw_splice_isoform if config_dict["global_parameters"]["generate_raw_isoform"] else None)
File "PATH/TO/FLAMES/python/sc_longread.py", line 1123, in group_bam2isoform
it_region = bamfile.fetch(ch, bl.s, bl.e)
File "pysam/libcalignmentfile.pyx", line 1081, in pysam.libcalignmentfile.AlignmentFile.fetch
File "pysam/libchtslib.pyx", line 686, in pysam.libchtslib.HTSFile.parse_region
ValueError: invalid contig `chr1

Error during re-alignment

Hello, FLAMES aligns my reads to the reference genome but during realignment I get this error:

### skip aligning reads to genome 2023-12-07 15:57:35
### read gene annotation 2023-12-07 15:57:35
remove similar transcripts in gene annotation: Counter({'duplicated_transcripts': 765})
### find isoforms 2023-12-07 15:59:27
Traceback (most recent call last):
  File "/users/sparthib/flames/python/bulk_long_pipeline.py", line 270, in <module>
    bulk_long_pipeline(args)
  File "/users/sparthib/flames/python/bulk_long_pipeline.py", line 202, in bulk_long_pipeline
    group_bam2isoform(genome_bam, isoform_gff3, tss_tes_stat, "", chr_to_blocks, gene_dict, transcript_to_junctions, transcript_dict, args.genomefa,
  File "/users/sparthib/flames/python/sc_longread.py", line 1115, in group_bam2isoform
    for c in get_fa(fa_f):
  File "/users/sparthib/flames/python/sc_longread.py", line 45, in get_fa
    for line in open(fn):
  File "/users/sparthib/.conda/envs/FLAMES/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I cloned the flames package from github and this is my environment info:


#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
bzip2                     1.0.8                hd590300_5    conda-forge
c-ares                    1.23.0               hd590300_0    conda-forge
ca-certificates           2023.11.17           hbcca054_0    conda-forge
editdistance              0.6.2           py310hc6cd4ac_2    conda-forge
htslib                    1.18                 h81da01d_0    bioconda
k8                        0.2.5                hdcf5f25_4    bioconda
keyutils                  1.6.1                h166bdaf_0    conda-forge
krb5                      1.21.2               h659d440_0    conda-forge
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
libblas                   3.9.0           20_linux64_openblas    conda-forge
libcblas                  3.9.0           20_linux64_openblas    conda-forge
libcurl                   8.4.0                hca28451_0    conda-forge
libdeflate                1.18                 h0b41bf4_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 13.2.0               h807b86a_3    conda-forge
libgfortran-ng            13.2.0               h69a702a_3    conda-forge
libgfortran5              13.2.0               ha4646dd_3    conda-forge
libgomp                   13.2.0               h807b86a_3    conda-forge
liblapack                 3.9.0           20_linux64_openblas    conda-forge
libnghttp2                1.58.0               h47da74e_0    conda-forge
libnsl                    2.0.1                hd590300_0    conda-forge
libopenblas               0.3.25          pthreads_h413a1c8_0    conda-forge
libsqlite                 3.44.2               h2797004_0    conda-forge
libssh2                   1.11.0               h0841786_0    conda-forge
libstdcxx-ng              13.2.0               h7e041cc_3    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
minimap2                  2.26                 he4a0461_2    bioconda
ncurses                   6.4                  h59595ed_2    conda-forge
numpy                     1.26.2          py310hb13e2d6_0    conda-forge
openssl                   3.2.0                hd590300_1    conda-forge
pip                       23.3.1             pyhd8ed1ab_0    conda-forge
pysam                     0.22.0          py310h41dec4a_0    bioconda
python                    3.10.13         hd12c33a_0_cpython    conda-forge
python_abi                3.10                    4_cp310    conda-forge
readline                  8.2                  h8228510_1    conda-forge
samtools                  1.18                 h50ea8bc_1    bioconda
setuptools                68.2.2             pyhd8ed1ab_0    conda-forge
tk                        8.6.13          noxft_h4845f30_101    conda-forge
tzdata                    2023c                h71feb2d_0    conda-forge
wheel                     0.42.0             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
zlib                      1.2.13               hd590300_5    conda-forge
zstd                      1.5.5                hfc55251_0    conda-forge

Any pointers would be appreciated, thank you!

versions for conda dependencies

The README states:

conda create -n FLAMES \
    "python>=3.7" samtools pysam minimap2 numpy editdistance \
    -c bioconda -c conda-forge
git clone https://github.com/LuyiTian/FLAMES.git

What are the supported versions for each of the listed conda dependencies?

run match_cell_barcode, no error, no result, match_cell_barcode /data_RAGE_seq/data1 cell_barcode_stat.txt split_barcode.fastq flame_3M-february-2018.txt 2; split_barcode.fastq is zero,no other file generation。

where is your whiltelist

In your script file, not find filtered_feature_bc_matrix/barcodes.tsv.gz

single cell full length RNA-seq mutation detection

hi~

I remember that you use FLAMES to detect mutation and plot the mutation in UMAP. I was so impressed by this part. However I notice that "mutation detection" was not included in sc_long_pipeline.py pipeline and config file while there is did a python script named "bam_mutation.py". I do not know how to use this script, can you provide a tutorial on how you did this ?

thanks

garfield
2021 12 29

Gene name instead of gene_ID

Hello,

Thanks for developing the tool.
I was wandering if there is a way to get the gene name in the output matrix instead of the transcript_ID or the gene_ID ?
It would be more convenient for downstream analysis to have the correspondence gene_ID == gene_name.

Thanks for your help.
Rania

What's the difference

hi ~
What's the difference between barcode hm match and barcode match.

FLAMES vs FLAIR

I'm evaluating FLAMES and FLAIR for my project. Can you comment on the conceptual or algorithmic differences between the two packages? For example, what aspect of FLAMES leads to its increased accuracy in benchmarking?

minimap error using sc_long_pipeline.py

Hi @LuyiTian ,
Could you please comment on my error below?

Running code:

for i in test; do /FLAMES/python/sc_long_pipeline.py --gff3 hg38v99.Cellranger.genes.gtf --infq $i.demultiplexed.fq.gz --outdir FLAMES_Output/$i --genomefa hg38v99.Cellranger.genome.fa --config_file /FLAMES/config_sclr_nanopore_default.json --minimap2_dir /Software/anaconda_py2/bin/  >$i.log 2>&1 & done

Error:

Use config file: config_sclr_nanopore_default.json

Parameters in configuration file:

comment : this is the default config for nanopore single cell long read data using 10X RNA-seq kit. use splice annotation in alignment.

global_parameters

	has_UMI : True

	generate_raw_isoform : False

isoform_parameters

	Min_sup_pct : 0.2

	MAX_SPLICE_MATCH_DIST : 10

	random_seed : 666666

	Min_cnt_pct : 0.001

	MAX_DIST : 10

	Min_sup_cnt : 5

	MAX_TS_DIST : 120

	Max_site_per_splice : 3

	strand_specific : -1

	remove_incomp_reads : 4

	min_fl_exon_len : 40

pipeline_parameters

	do_transcript_quantification : True

	do_read_realignment : True

	do_genome_alignment : True

	do_isoform_identification : True

transcript_counting

	min_tr_coverage : 0.4

	min_read_coverage : 0.4

realign_parameters

	use_annotation : True

alignment_parameters

	no_flank : False

	use_junctions : True

output directory not exist, create one:

FLAMES_Output/test

Input parameters:

	gene annotation: hg38v99.Cellranger.genes.gtf

	genome fasta: hg38v99.Cellranger.genome.fa

	input fastq: test.demultiplexed.fq.gz

	output directory: FLAMES_Output/test

	directory contains minimap2: /Software/anaconda_py2/bin/

### align reads to genome using minimap2 2021-01-30 12:48:05



Traceback (most recent call last):

  File "/FLAMES/python/sc_long_pipeline.py", line 213, in <module>

    sc_long_pipeline(args)

  File "/FLAMES/python/sc_long_pipeline.py", line 159, in sc_long_pipeline

    minimap2_align(args.minimap2_dir, args.genomefa, args.infq, tmp_bam, no_flank=config_dict["alignment_parameters"]["no_flank"], bed12_junc=tmp_bed if config_dict["alignment_parameters"]["use_junctions"] else None)

  File "/FLAMES/python/minimap2_align.py", line 37, in minimap2_align

    print subprocess.check_output([align_cmd], shell=True, stderr=subprocess.STDOUT)

  File "/Software/anaconda_py2/lib/python2.7/subprocess.py", line 223, in check_output

    raise CalledProcessError(retcode, cmd, output=output)

subprocess.CalledProcessError: Command '['/Software/anaconda_py2/bin/minimap2 -ax splice -t 12 --junc-bed FLAMES_Output/test/tmp.splice_anno.bed12 --junc-bonus 1  -k14 --secondary=no hg38v99.Cellranger.genome.fa test.demultiplexed.fq.gz | samtools view -bS -@ 4 -m 2G -o FLAMES_Output/test/tmp.align.bam -  ']' returned non-zero exit status 1

bam_mutations.py

Hi,

Thank you for this tool.

I would like to know if we can only run mutation analysis without full-length transcriptome splicing. I have mapped bam and barcodes files.

Thanks

Compute resource allocation?

Is there a way to define number of cores, RAM usage, etc. for the pipelines?

fsm_splice_comp.csv

Hi ,
I am trying to utilize the tr_classify analysis script from the FLTseq_data directory, and I am just wondering how the fsm_splice_comp.csv create (I have runed sc_long_pipeline,however no this file in the output)?

(Line 43)

fsm_splice_comp <- read.csv(file.path(data_dir,"fsm_splice_comp.csv"), header=FALSE, stringsAsFactors=FALSE)
Thank you

Typo in parse_realigned_bam?

On line 88, read_dict[r][0] will be assigned with (tr, rec.get_tag("AS"), tr_cov, float(rec.query_alignment_length)/rec.infer_read_length(), rec.mapping_quality), contradicting the comment on line 106 # transcript_id, pct_ref, pct_reads.

hit[1] > 0.8 was used on line 119, which would be evaluating alignment score > 0.8.

0.8 seems to be a very low threshold for alignment scores, did you mean to evaluate pct_ref > 0.8 (i.e. hit[2] > 0.8)?

UMI deduplication in pipeline output?

Hi, I'm just wondering whether the counts table generated from the pipeline are already UMI deduplicated counts. If not, how would I go about generating these from the FLAMES output?

In addition, I found that for my transcript IDs for the mouse samples (from pipeline_output/transcript_count.csv), I'm getting quite a few transcript IDs that start with ENMUSG instead of ENMUST. Am I correct in thinking that these are gene codes instead of transcript IDs, and why would that be the case?

Source compilation errror

Hi @LuyiTian , I am currently using FLAMES for single cell isoform identification and detection. Now I'm at the barcode assignment steps, where I ran the compilation code g++ -std=c++11 -lz -O2 -o match_cell_barcode ssw/ssw_cpp.cpp ssw/ssw.c match_cell_barcode.cpp kseq.h edit_dist.cpp as shown in the README, but I get the following error:

What is the problem?

Isoform parameters

Hi there,

Thank you so much for this amazing tool!

I am just wondering if it is possible to get a more in-depth explanation of each parameter for the config file e.g. for isoform parameters?

Thank you

Flanking sequence match

Can I know when flames match flanking sequence CTACACGACGCTCTTCCGATCT, do they allow matching with an edit distance or it has to be exact match?

match_cell_barcode - output cell barcode statistics file

Hi, I am using match_cell_barcode for ONT single cell data. I obtained a "whitelist.csv" and "putative_bc.csv" file from the output of BLAZE, at the same time I also have short-read sequencing data on the same library.

However, I am confused what file should be used for the 2nd argument of match_cell_barcode, that is "output cell barcode statistics file", or as explained in README "a file name/path for the statistics of barcode matching".

Can you please help with understanding this file? How should it look like (which headers?) and how can I get it?

Thanks in advance!