hemberg-lab / microexonator Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 7.0 7.08 MB

Snakemake pipeline for microexon discovery and quantification

Python 95.18% R 4.82%

microexonator's People

Contributors

Stargazers

Watchers

Forkers

vkedlian natnaelt kant frattalab delayed-gitification chenyc0901 chiangtw

microexonator's Issues

Error in job Output

Hi! When I was running MicroExonator, at later process I got errors as follows:

============================================
$ ~/anaconda3/envs/snakemake_env/bin/snakemake -s MicroExonator.smk --use-conda -k -j 10 --rerun-incomplete
Provided cores: 10
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 Output
1 high_confident_filters
1 quant
3

rule Output:
input: Round2/TOTAL.ME_centric.txt, Round2/TOTAL.sample_cov_filter.txt, Round2/TOTAL.ME_centric.ME_matches.txt
output: Report/out_filtered_ME.txt, Report/out_low_scored_ME.txt, Report/out_shorter_than_3_ME.txt
log: logs/Output.log
jobid: 4

Error in job Output while creating output files Report/out_filtered_ME.txt, Report/out_low_scored_ME.txt, Report/out_shorter_than_3_ME.txt.
RuleException:
CalledProcessError in line 159 of /lustre7/home/tsuna8984/MicroExonator/rules/Round2_post_processing.smk:
Command 'Rscript /lustre7/home/tsuna8984/MicroExonator/src/.snakemake.7ax85rg2.final_filters3.R' returned non-zero exit status 1.
File "/lustre7/home/tsuna8984/MicroExonator/rules/Round2_post_processing.smk", line 159, in __rule_Output
File "/home/tsuna8984/anaconda3/envs/snakemake_env/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Job failed, going on with independent jobs.
Exiting because a job execution failed. Look above for error message

I confirmed that input files actually exist in Round2 directory. I did not make any modifications in the scripts of final_filters3.R. Could you give me any idea to solve this?
Thank you.

Bug again... with Get_data.smk

Dear developer

There are a lot issues that I can't solve again...

Here is the error message I've got:
snakemake -s MicroExonator.smk --cluster-config cluster.json --cluster "sbatch --cpus-per-task {cluster.cpus} -p {cluster.p} --output {cluster.output} --error {cluster.error} --mem {cluster.mem}" --use-conda -k -j 50 -np
Building DAG of jobs...
MissingInputException in line 16 of /home/hyojung/srcGit/MicroExonator2/MicroExonator/rules/Get_data.smk:
Missing input files for rule download_fastq:
download/round2/Murine_cortex:CamK_D_1.download.sh

I tried dry-run of MicroExonator before the long list of local files. But something wrong.
Yes, I also tried a different version of config.yaml, like Optimize_hard_drive: T/F as well.

Thank you.

Filtering: Possible bug and a question

Hi,

(1) I think the paper Methods and the actual code disagree in calculating the Ms score. In the paper the text reads

... we calculate a score, Ms, for each putative microexon as Ms =1 − (1 − PsPU2)/n, where PU2 is the probability that the observed U2 score came from the Gaussian with the higher mean and n is the number of matches for a given intron.

The corresponding calculation in the code seems to be (from https://github.com/hemberg-lab/MicroExonator/blob/master/src/final_filters3.R, lines 85-95)

fit_U2_score <- normalmixEM(ME_matches_filter$U2_score, maxit = 10000, epsilon = 1e-05)
#ggplot_mix_comps(fit_U2_score, "Mixture model Micro-exon >=3 after coverge filter")
post.df <- as.data.frame(cbind(x = fit_U2_score$x, fit_U2_score$posterior))

#ME_final <- ME_centric_raw[ME %in% uniq_seq_filter & len_micro_exon_seq_found>=3, ]
if(fit_U2_score$mu[1]<=fit_U2_score$mu[2]){
  ME_final$ME_P_value <-  1 - (1 - approx(post.df$x, post.df$comp.1, ME_final$U2_scores)$y * ME_final$P_MEs) / ME_final$total_number_of_micro_exons_matches
} else {
  ME_final$ME_P_value <-  1 - (1 - approx(post.df$x, post.df$comp.2, ME_final$U2_scores)$y * ME_final$P_MEs)/ ME_final$total_number_of_micro_exons_matches
}

So if ME_final$ME_P_value should be identified with Ms the code calculates 1 − (1 − Ps(1-PU2))/n and not 1 − (1 − PsPU2)/n defined in the paper (because it takes the probability of the LOWER mean component, and not HIGHER mean.)

Or should Ms actually be 1 - ME_final$ME_P_value? In that case it will not match the paper either, I think ...

(2) In general, I am not sure I can understand the logic behind the expression for the Ms score in the paper. High scores are likely to indicate true microexons, and by definition Ms = 1 − (1 − PsPu2)/n = 1 - 1/n + PsPu2/n. Wouldn't you want Ms decrease with increasing Ps? The formula is the opposite. Also, if I understood correctly, n is the actual number of microexon+splice sites exact matches in the given intron sequence. I would expect Ms to grow for lower n (n>0), i.e., decrease with n increasing, But the formula is the opposite again. Shouldn't it be something like (1-Ps)Pu2/n? Or if you take it as 1 - ME_final$ME_P_value from the code it will be [1-(1-Pu2)Ps]/n, which also would make sense to me. Am I missing the point completely?

Thank you

Error in rule hisat2_to_Genome

Hi there,

I tried to use MicroExonator to analyze my dataset. The program finished successfully with -np. However, Errors as shown below occurred when I used it for my data analysis with "snakemake -s MicroExonator.smk --use-conda -k -c1". Could you help me to figure out what is the problem. Thanks in advance.
Error log:
[Tue Oct 12 18:54:15 2021]
rule hisat2_to_Genome:
input: Round1/SRR10316370_1_val_1.sam.row_ME.fastq, data/Genome.1.ht2
output: Round1/SRR10316370_1_val_1.sam.row_ME.Genome.Aligned.out.sam
jobid: 23
wildcards: sample=SRR10316370_1_val_1
resources: tmpdir=/tmp

Activating conda environment: /scratch/users/[email protected]/MicroExonator/MicroExonator/.snakemake/conda/b2cf3cb0bb8978aae83afeb8d09ff3fa
[Tue Oct 12 18:54:20 2021]
Error in rule hisat2_to_Genome:
jobid: 23
output: Round1/SRR10316370_1_val_1.sam.row_ME.Genome.Aligned.out.sam
conda-env: /scratch/users/[email protected]/MicroExonator/MicroExonator/.snakemake/conda/b2cf3cb0bb8978aae83afeb8d09ff3fa
shell:
hisat2 -x data/Genome -U Round1/SRR10316370_1_val_1.sam.row_ME.fastq > Round1/SRR10316370_1_val_1.sam.row_ME.Genome.Aligned.out.sam
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job hisat2_to_Genome since they might be corrupted:
Round1/SRR10316370_1_val_1.sam.row_ME.Genome.Aligned.out.sam
Job failed, going on with independent jobs.
Select jobs to execute...

Best regards,

Mingming
Errors.txt

Cannot process two consecutive (annotated) microexons

It looks like when there are two annotated microexons one next to the other the splice junctions are not created and the microexons get ignored (and not reported anywhere in the final output even if they have support.)

Here is an example snapshot of my analysis. I took the junction names from the fasta headers in the Round1 and Round2 directories. Here we have two 9-nt exons one after the other, and they are lost.

"missing" means missing splice junction to the next exon (>=30bp), not missing exon

The 3 columns are annotated exons from gtf (with their length), splice junctions constructed in Round1, and splice junctions constructed in Round2

ME_filter1.py is not working

Hello,
I try to run MicroExonator on paired-end RNA-seq data. I can't resolve the issue with ME_filter1.py.
I did everything as descriped in the manual, my config.yaml file:
Genome_fasta : /data/resources/mouse/genome/GRCm38.p6.genome.fa Gene_anontation_bed12 : /data/resources/mouse/genome/mm10_UCSC_knownGene.bed GT_AG_U2_5 : /data/MicroExonator/PWM/Mouse/mm10_GT_AG_U2_5.good.matrix GT_AG_U2_3 : /data/MicroExonator/PWM/Mouse/mm10_GT_AG_U2_3.good.matrix conservation_bigwig : /data/resources/mouse/PhastCons/mm10.60way.phastCons.bw working_directory : /data/MicroExonator ME_len : 30 Optimize_hard_drive : T min_number_files_detected : 3 paired_samples : /data/MicroExonator/paired_samples.txt
Then I started with
snakemake -s MicroExonator.skm --use-conda -k -j 32
Which led to more or less the same error for every single input file:
Error in rule Round1_filter: jobid: 159 RuleException: CalledProcessError in line 56 of /data/MicroExonator/rules/Round1_post_processing.skm: Command 'source activate /data/MicroExonator/.snakemake/conda/f2d123d5; set -euo pipefail; python3 src/ME_filter1.py /data/resources/mouse/genome/GRCm38.p6.genome.fa Round1/Scnn_3_R2.sam.row_ME Round1/Scnn_3_R2.sam.row_ME.Genome.Aligned.out.sam data/GT_AG_U2_5.pwm data/GT_AG_U2_3.pwm /data/resources/mouse/PhastCons/mm10.60way.phastCons.bw 30 > Round1/Scnn_3_R2.sam.row_ME.filter1 ' returned non-zero exit status 1. File "/data/MicroExonator/rules/Round1_post_processing.skm", line 56, in __rule_Round1_filter File "/home/stephan/anaconda3/envs/snakemake_env/lib/python3.6/concurrent/futures/thread.py", line 56, in run output: Round1/Scnn_1_R2.sam.row_ME.filter1 Removing output files of failed job Round1_filter since they might be corrupted: Round1/Scnn_3_R2.sam.row_ME.filter1 conda-env: /data/MicroExonator/.snakemake/conda/f2d123d5 Job failed, going on with independent jobs.

I figured out, that I had not installed all dependencies since they weren't given in the installation. I installed them manually following the import statetments in ME_filter1.py. Since Biopython is not longer supported for Python2, I switched everything to Python3 and removed the not working print statements from the script.
Now, I can get a few lines of output, by manually calling ME_filter1.py, but at some point the script will crash with:
python3 src/ME_filter1.py /data/resources/mouse/genome/GRCm38.p6.genome.fa Round1/SST_1_R1.sam.row_ME Round1/SST_1_R1.sam.row_ME.Genome.Aligned.out.sam data/GT_AG_U2_5.pwm data/GT_AG_U2_3.pwm /data/resources/mouse/PhastCons/mm10.60way.phastCons.bw 30
Working output:
D00535:22:CBLNNANXX:2:2202:18520:19601 CGCCAGCCAGAGCAGGCCCGCCGGCCCCTCAGTGTTGCCACAGACAACATGATGCTGGAGTTTTACAAGAAGGATGGCCTTAGGAAAATCCAAAGCATGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGBBB@? chr11:65012098-65021940|uc007jku.1|100_100|76M18I7M 101 chr11|-|65012906|6S18M9016N77M 95 True 18 CCTTAGGAAAATCCAAAG 1 78.9604235156872 1.0 chr11_-_65012906_65012924 78.9604235156872 1.0 [...]
and then the error:
Traceback (most recent call last): File "src/ME_filter1.py", line 372, in <module> main(sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5], sys.argv[6], int(sys.argv[7])) File "src/ME_filter1.py", line 364, in main print(read, seq, qual, tag_alingment, t_score, genome_alingment, g_score, same_ME, len(DR_corrected_micro_exon_seq_found), DR_corrected_micro_exon_seq_found, len(micro_exons), max(U2_scores), max(TOTAL_mean_conservation), micro_exons_coords, ",".join(map(str, U2_scores)), ",".join(map(str, TOTAL_mean_conservation))) TypeError: '>' not supported between instances of 'NoneType' and 'float'
How can I fix this? Can you please provide a full list of needed dependencies?

example of local_samples.tsv

I can't find an example of local_samples.tsv?

Running microexonator with list of computing nodes

Hi
Thank you for developing this great pipeline. Here, I have a simple question to run the MicroExonator using our computing infra.
When snakemake performs a job with MicroExonator, is it possible to submit a job by explicitly passing the HOSTFILE list (e.g. node01, node02, node03, node04) instead of submitting the job with qsub?

That is, after allocating large computational resources with qsub, we want to execute MicroExonator by delivering the HOSTFILE list.

Regards,
Hanna

Very long time to create snakemake environment?

Hi there,

It takes almost one week to create an environment to run snakemake, and does not finish yet. Is that normal or something wrong?
Thanks.

Best regards,

Ming

missing input file in Get_data.smk

Hi, I'm trying to use this code for my research.
It looks very promising.

Now, I'm trying to use MicroExonator for the downloaded local files.
But it spits out an error and I can't figure it out.

Here is my config.yaml
$ cat config.yaml
Genome_fasta : /blues/ngs//UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
Gene_anontation_bed12 : /home/MicroExonator2/MicroExonator/gencode_vm25.bed12
GT_AG_U2_5 : /home/MicroExonator2/MicroExonator/PWM/Mouse/mm10_GT_AG_U2_5.good.matrix
GT_AG_U2_3 : /home//MicroExonator2/MicroExonator/PWM/Mouse/mm10_GT_AG_U2_3.good.matrix
conservation_bigwig : /home//MicroExonator/mm10.60way.phastCons.bw
working_directory : /home//MicroExonator2/MicroExonator
ME_len : 30
Optimize_hard_drive : T
min_number_files_detected : 2
paired_samples: /home/~~/MicroExonator2/MicroExonator/local_samplePairs.tsv

My input file for local_samples.tsv is here:
path sample
/blues/ngs/data/~~~/SRR9623780_1.fastq.gz Murine_cortex:SRP212708A1
/blues/ngs/data/~~~/SRR9623780_2.fastq.gz Murine_cortex:SRP212708A2
/blues/ngs/data/~~~/SRR9623802_1.fastq.gz Murine_cortex:SRP212708B1
/blues/ngs/data/~~~/SRR9623802_2.fastq.gz Murine_cortex:SRP212708B2
/blues/ngs/data/~~~/SRR9623803_1.fastq.gz Murine_cortex:SRP212708C1
/blues/ngs/data/~~~/SRR9623803_2.fastq.gz Murine_cortex:SRP212708C2

My file for the paired_samples is here:
Murine_cortex:SRP212708A1 Murine_cortex:SRP212708A2
Murine_cortex:SRP212708B1 Murine_cortex:SRP212708B2
Murine_cortex:SRP212708C1 Murine_cortex:SRP212708C2

When I run the code like this:
$ snakemake -s MicroExonator.smk --cluster-config cluster.json --cluster "sbatch --cpus-per-task {cluster.cpus} -p {cluster.p} --output {cluster.output} --error {cluster.error} --mem {cluster.mem}" --use-conda -k -j 1

I only found the error message like this:
Building DAG of jobs...
MissingInputException in line 16 of /home/hyojung/srcGit/MicroExonator2/MicroExonator/rules/Get_data.smk:
Missing input files for rule download_fastq:
download/round2/Murine_cortex:SRP212708A2.download.sh

In the download folder there is no sub-folder named 'round2', but "Murine_cortex:SRP212708A2.download.sh" is exist.
I can't understand why.

Thank you.

Regards,
Hanna

scalability of the MicroExonator with single-cell RNA seq data

Dear developer

In the previous bug issue, I solve the bug with your support. Thank you again.
Based on the analysis results of Microexonator, now I'm trying to develop my hypothesis using single-cell RNA seq (scRNA seq) data from the public DB.

Here, I have two questions.

Does Microexonator only support SMART2-seq or MARS-seq data?
10X data doesn't have *.fastq files for each cell.
So, I guess the data from 10X platform can't generate whippet_delta.yaml, which is needed for scRNA seq analysis.
Can you share the running time for the analysis of 1657 cells from your published paper?
We have several clusters with different numbers of cores/nodes.
In the analysis of bulk-RNA seq data, we analyzed over 100 samples and it takes a bit of time using 64 cores.
Now, we will analyze over 2000 cells and fastq files.
So, if we know the running time for your previous analysis on that scale,
it would be helpful to determine which cluster we will use.

I would glad to hear from you.

Thank you.
Regards,
Hanna

SIngle Cell

Hi @geparada

Thanks for this wonderful tool. I am trying to use MicroExonator on my long read single cell RNA seq data. I am having few very basic queries regarding this. Do I need to generated a pseudo bulk cell file for each cell type? Also do I need to run the Discovery and Quantification step or directly jumped into the Single Cell Analysis step? Finally, do we need to provide the fastq file for each celltype or we can provide the bam/sam file also.

Output rule Round2 Error: object 'min_number_files_detected' not found (easy fix)

Hi,

Just reporting a small error that occurs with the Output rule (round2) when generating the html report. Received the following error message:

Quitting from lines 196-214 (final_filters2.Rmd)
Error in FUN(X[[i]], ...) : object 'min_number_files_detected' not found

Fortunately this was a simple fix. In src/final_filters2.Rmd, 'min_number_files_detected' is not defined from params in chunk 3:


ME_table=params$ME_table
ME_coverage=params$ME_coverage
ME_matches_file=params$ME_matches_file

out_filtered_ME=params$out_filtered_ME
out_low_scored_ME=params$out_low_scored_ME
out_shorter_than_3_ME=params$out_shorter_than_3_ME


out_filtered_ME_cov = params$out_filtered_ME_cov

Defining the variable in chunk 3 eliminates the error message and the rest of the script runs without other errors.
min_number_files_detected=params$min_number_files_detected

Thanks,
Sam

error 'min_number_files_detected'

Hi,

I'm excited to be using this new tool!

I've made a config.yaml file:

Genome_fasta : ~/MicroExonator/c_elegans.PRJNA13758.WS276.genomic.fa
Gene_anontation_bed12 : ~/MicroExonator/Caenorhabditis_elegans.WBcel235.bed12
GT_AG_U2_5 : NA
GT_AG_U2_3 : NA
conservation_bigwig : ~/MicroExonator/ce11.phyloP26way.bw
working_directory : ~/MicroExonator/
ME_DB : ~/MicroExonator/Cel
ME_len : 27
Optimize_hard_drive : T

And a design.tsv file

path	sample
/scratch/xx.fastq	xx
/scratch/yy.fastq	yy

And then did a dry-run:
snakemake -s MicroExonator.skm --cluster-config config.yaml --use-conda -k -np

But I'm getting the error:

KeyError in line 106 of /gpfs/fs1/home/MicroExonator/rules/Round2_post_processing.skm:
'min_number_files_detected'
File "/gpfs/fs1/MicroExonator/MicroExonator.skm", line 69, in
File "/gpfs/fs1/MicroExonator/rules/Round2_post_processing.skm", line 106, in

Could you let me know how to fix this?

Thanks

Missing config.yaml Reference Files in Documentation, Analyses

It seems the analysis strongly depends on the annotation files provided in config.yaml. In the readme, the origin of some of the annotation files is vague. I would appreciate clarification on the following:

From the README:
"ME_DB is a path to a known Microexon database such as VAST DB (is this optional or no?)"

As far as I can tell, VASTDB does not directly provide a bed12 of all microexons. Is it possible to link or provide the list of microexons? How does this file change the analysis? Is it actually optional?

Similarly for "Gene_anontation_bed12" the UCSC table browser link defaults to Gencode basic v19. How would using the GENCODE basic vs comprehensive annotations affect the analysis?

local_samples.tsv - specify multiple FASTQs (paired-end reads) in 'path' column?

Hi,

Thanks for providing the tool, looks very promising. I have a question about configuring the local_samples.tsv file when inputting local FASTQ files.

In your example local_samples.tsv file, there is just one FASTQ file specified for each sample. The samples I want to run have paired end reads in separate files. Can MicroExonator handle paired end reads in separate files, or do I need to provide interleaved FASTQ files?

I've had a peek at init.skm (lines 59-85), and by the looks of it only one FASTQ file can be provided per row, but I would really appreciate any clarification you can give.

Thanks for your time and all the best,
Sam

error "MissingInputException" and "Missing input files for rule GetPWM"

Hi,

And a design.tsv file

path sample
/scratch/xx.fastq xx
/scratch/yy.fastq yy

And then did a dry-run:
snakemake -s MicroExonator.skm --cluster-config config.yaml --use-conda -k -np

But I'm getting the error:

Building DAG of jobs...
MissingInputException in line 42 of /gpfs/fs1/home/j/jcalarco/bsugumar/MicroExonator/rules/Get_data.skm:
Missing input files for rule GetPWM:
~/MicroExonator/Caenorhabditis_elegans.WBcel235.bed12
~/MicroExonator/c_elegans.PRJNA13758.WS276.genomic.fa

Thanks for your help!

Bina

hemberg-lab / microexonator Goto Github PK

microexonator's People

Contributors

Stargazers

Watchers

Forkers

microexonator's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs