hku-bal / clairs-to Goto Github PK

View Code? Open in Web Editor NEW

38.0 3.0 3.0 4.61 MB

ClairS-TO - a deep-learning method for tumor-only somatic variant calling

License: BSD 3-Clause "New" or "Revised" License

Dockerfile 0.47% Python 81.84% C++ 7.51% C 9.17% Shell 0.69% Makefile 0.33%

bioinformatics deep-learning genomics illumina long-read-sequencing long-reads nanopore ont pacbio snvs

clairs-to's People

Contributors

Stargazers

Watchers

Forkers

changqingw randolium yongchaodou

clairs-to's Issues

Indexing error at phasing step

Hi,

Firstly, many thanks for this incredibly useful tool.

I am running ClairS-TO with the singularity container on a ONT WGS run. The library was basecalled and aligned using dorado 0.6.2.

Command used to run

singularity exec -B ${input_dir},${ref_dir},${output_dir} --bind=/local:/local:rw /b06x-isilon/b06x-m/mnp_nanopore/software/clairs-to_latest.sif \
          /opt/bin/run_clairs_to \
          --tumor_bam_fn ${input_dir}/${id}.hg38.bam \
          --sample_name ${id} \
          --ref_fn ${ref_dir}/hg38.fa \
          --threads ${params.threads} \
          --platform ${params.platform} \
          --output_dir ${output_dir} \
          --conda_prefix /opt/micromamba/envs/clairs-to \

All the previous steps run fine and then it runs into this error at the phasing step:

[INFO] Phase the Tumor BAM
[INFO] RUN THE FOLLOWING COMMAND:
( parallel --joblog /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/snv/ClairsTo/logs/phasing_log/parallel_2_phase_tumor.log -j 8 /opt/micromamba/envs/clairs-to/bin/longphase phase  -s /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/snv/ClairsTo/tmp/phasing_output/vcf/{1}.vcf -b /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/bam/392718.calls.mods.sorted.hg38.bam -r /b06x-isilon/b06x-m/mnp_nanopore/software/hg38/hg38.fa -t 8 -o /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/snv/ClairsTo/tmp/phasing_output/phased_vcf_output/tumor_phased_{1} --ont :::: /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/snv/ClairsTo/tmp/CONTIGS && parallel -j 8 bgzip -f /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/snv/ClairsTo/tmp/phasing_output/phased_vcf_output/tumor_phased_{1}.vcf :::: /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/snv/ClairsTo/tmp/CONTIGS ) 2>&1 | tee /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/snv/ClairsTo/logs/phasing_log/2_phase_tumor.log && parallel -j 8 tabix -f -p vcf /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/snv/ClairsTo/tmp/phasing_output/phased_vcf_output/tumor_phased_{1}.vcf.gz :::: /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/snv/ClairsTo/tmp/CONTIGS

parsing VCF ... [W::vcf_parse] Contig '##contig=<ID=chr1,length=248387328>' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::bcf_hrec_check] Invalid contig name: "##contig=<ID=chr1,length=248387328>"
pos 0 missing GT value
parsing VCF ... [W::vcf_parse] Contig '##contig=<ID=chr1,length=248387328>' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::bcf_hrec_check] Invalid contig name: "##contig=<ID=chr1,length=248387328>"
pos 0 missing GT value
parsing VCF ... [W::vcf_parse] Contig '##contig=<ID=chr1,length=248387328>' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::bcf_hrec_check] Invalid contig name: "##contig=<ID=chr1,length=248387328>"
pos 0 missing GT value

The full log file is here-
run_clairs_to.log

Could you please help me with this?

Many thanks,
Areeba.

Different output

Hi,

First of all thank you for ClairS-TO.
I have been trying to use it with bed file option.
I get different output for the same sample if I re-run it.
In some output, it skips certain positions of a chromosome. It shows in the tmp file if I check individual chromosome vcf file. But some positions or entire chromosome get missed out in a merged output vcf. Also does not show phasing in a final vcf.

Thank you in advance.

Shreya

RefCall variants

Hi there! We are testing clairS-TO with the --print_ref_calls option. My question is: what does the RefCall flag mean? Could you further elaborate on it?

I am aware that in other software it can mean that the variant was proposed as mutation but later is rejected by the variant caller; however, I would appreciate if you can provide further information on what it means in ClairS-TO.

Thank you very much and thank you as well for your tool!
Regards!

bcftools doesn't like tag name '1kGPoN'

I am getting "[W::bcf_hrec_check] Invalid tag name: "1kGPoN"" error when processing the vcf generated by ClairS-TO pipeline.

My understanding is that bcftools doesn't like any tag that starts with a number.

Can you change the tag name to PoN instead in shared/vcf.py line 24 and src/nonsomatic_tagging.py line 294?

Thanks a lot in advance.

Documentation: training data to create a model

Hi @aquaskyline,

Thanks for developing such a great tool! I am currently testing your variant caller with ONT data originated from different organisms, and it could nicely recognize SNPs but I am having troubles with short deletions.
I was wondering whether it makes sense for me to train it with my own data to create a model. I saw that for Clair3 it is explained in the documentation how to train data, but I did not find the documentation for ClairS-TO.
If this is possible, would you please add in the documentation how users can create their own models?

Thank you very much!

IndexError during STEP1

I got heaps of IndexError while running clairs_to with the latest docker image on Singularity:

Traceback (most recent call last):
  File "/opt/bin/clairs_to.py", line 107, in <module>
    main()
  File "/opt/bin/clairs_to.py", line 101, in main
    submodule.main()
  File "/opt/bin/src/extract_candidates_calling.py", line 597, in main
    extract_pair_candidates(args)
  File "/opt/bin/src/extract_candidates_calling.py", line 341, in extract_pair_candidates
    select_indel_candidates=select_indel_candidates
  File "/opt/bin/src/extract_candidates_calling.py", line 91, in decode_pileup_bases
    base_list[-1][1] = base + pileup_bases[base_idx: base_idx + advance]  # add indel seq
IndexError: list index out of range

I wonder if this is expected, or am I actually losing these candidate indels due to the error.

F1 scores and high coverage datasets

Hi there!! We're currently testing ClairS-To using ONT DNA reads and a variant truth set to call SNPs. We have noticed; however, that the F1 score decreases relevently whenever the coverage is above 1000x (e.g. F1 of 0.90 if cov. 1000x, F1 of 0.5 when using higher coverage).

Therefore, I wanted to kindly ask: could there be a reason for this behaviour?

Thank you very much for any reply in advance and thank you very much for ClairS-TO
Cheers!

Value Error in Step 2-3 CALL_VARIANTS

Hi, I am using ClairS-TO to call SNVs in tumor data (Ilmn). In step 2-3 (pileup Model Calling Variants), roughly 6 hours into the job, I run into this problem:

[INFO] Pileup Model Calling Variants
[INFO] RUN THE FOLLOWING COMMAND:
( parallel --joblog /hpc/pmc_holstege/rstudio/oscar_nanopore/snv_clairsto/SNV_called/logs/parallel_2-3_call_variants.log -j 24 python3 /opt/bin/clairs_to.py call_variants --predict_fn /hpc/pmc_holstege/rstudio/oscar_nanopore/snv_clairsto/SNV_called/tmp/predict/{1/} --call_fn /hpc/pmc_holstege/rstudio/oscar_nanopore/snv_clairsto/SNV_called/tmp/vcf_output/p_{1/}.vcf --platform ilmn --likelihood_matrix_data /opt/micromamba/envs/clairs-to/bin/clairs-to_models/ilmn/likelihood_matrix.txt :::: /hpc/pmc_holstege/rstudio/oscar_nanopore/snv_clairsto/SNV_called/tmp/candidates/CANDIDATES_FILES ) 2>&1 | tee /hpc/pmc_holstege/rstudio/oscar_nanopore/snv_clairsto/SNV_called/logs/2-3_CALL_VARIANTS.log

[INFO] Calling tumor-only somatic variants ...
[INFO] Total time elapsed: 0.04 s
[INFO] Calling tumor-only somatic variants ...
Traceback (most recent call last):
File "/opt/bin/clairs_to.py", line 107, in
main()
File "/opt/bin/clairs_to.py", line 101, in main
submodule.main()
File "/opt/bin/clairs/call_variants.py", line 646, in main
call_variants_from_probability(args)
File "/opt/bin/clairs/call_variants.py", line 567, in call_variants_from_probability
output_vcf_from_probability(
File "/opt/bin/clairs/call_variants.py", line 234, in output_vcf_from_probability
best_match_alt_list, tumor_supported_reads_count_list = rank_variant_alt(
ValueError: too many values to unpack (expected 2)

The traceback is repeated hundreds of times. Step 3 seems to work fine even with this issue, but the vcf file doesn't contain any SNVs in the end.

do you have any suggestions on what might be the cause? Thanks

Get a large number of variants from ClairS-TO

Hi, I am using Clairs-TO to call somatic small variant from PacBio Revio tumor data. As a result, ClairS-TO detected ~28000 variants, which is much larger than I expected. I am not sure whether there are many false positive and thus post this issue to enquire if it's normal condition when using ClairS-TO.

Thanks!

Error at Haplotag BAM step

Thank you for this important piece of software. Using fresh pull of singularity image for ClairS-TO, but at the haplotagging stage get the output below.

I've made sure I am not loading a local instance of conda before I run the pipeline on a slurm node, but the image seems to be picking up pysam locally rather than using the instance inside the container - which it can see as it runs fine on other steps.

Command used to run ($A represents sample ID passed from slurm):

singularity exec ~/beggsa-clinicalnanopore/software/clairs-to_latest.sif /opt/bin/run_clairs_to -T "$A".sorted.bam -R ~/beggsa-clinicalnanopore/genomes/grch38/genome.fa -o clairs-to/ -t 32 -p ont_r10_dorado_hac_4khz --conda_prefix /opt/micromamba/envs/clairs-to
OS: RedHat Enterprise 8.3

Output:

parsing contig/chromosome: chrEBV ... skip
writeResult ... 2s

[INFO] Haplotag the Tumor BAM
[INFO] RUN THE FOLLOWING COMMAND:
( parallel --joblog /rds/projects/b/beggsa-sarcomaaccelerator/LongRead/LA95_Sarcoma_03-05-24/S475276/20240503_1646_3A_PAW55915_ceedab86/clairs-to/logs/phasing_log/parallel_3_haplotag_tumor.log -j 32 /opt/micromamba/envs/clairs-to/bin/whatshap haplotag --output /rds/projects/b/beggsa-sarcomaaccelerator/LongRead/LA95_Sarcoma_03-05-24/S475276/20240503_1646_3A_PAW55915_ceedab86/clairs-to/tmp/phasing_output/phased_bam_output/tumor_{1}.bam --reference /rds/homes/b/beggsa/beggsa-clinicalnanopore/genomes/grch38/genome.fa --regions {1}  --ignore-read-groups /rds/projects/b/beggsa-sarcomaaccelerator/LongRead/LA95_Sarcoma_03-05-24/S475276/20240503_1646_3A_PAW55915_ceedab86/clairs-to/tmp/phasing_output/phased_vcf_output/tumor_phased_{1}.vcf.gz /rds/projects/b/beggsa-sarcomaaccelerator/LongRead/LA95_Sarcoma_03-05-24/S475276/20240503_1646_3A_PAW55915_ceedab86/S475276.sorted.bam :::: /rds/projects/b/beggsa-sarcomaaccelerator/LongRead/LA95_Sarcoma_03-05-24/S475276/20240503_1646_3A_PAW55915_ceedab86/clairs-to/tmp/CONTIGS ) 2>&1 | tee /rds/projects/b/beggsa-sarcomaaccelerator/LongRead/LA95_Sarcoma_03-05-24/S475276/20240503_1646_3A_PAW55915_ceedab86/clairs-to/logs/phasing_log/3_tumor_haplotag.log && parallel -j 32 samtools index  -@32 /rds/projects/b/beggsa-sarcomaaccelerator/LongRead/LA95_Sarcoma_03-05-24/S475276/20240503_1646_3A_PAW55915_ceedab86/clairs-to/tmp/phasing_output/phased_bam_output/tumor_{1}.bam :::: /rds/projects/b/beggsa-sarcomaaccelerator/LongRead/LA95_Sarcoma_03-05-24/S475276/20240503_1646_3A_PAW55915_ceedab86/clairs-to/tmp/CONTIGS

Traceback (most recent call last):
  File "/opt/micromamba/envs/clairs-to/bin/whatshap", line 7, in <module>
    from whatshap.__main__ import main
  File "/opt/micromamba/envs/clairs-to/lib/python3.9/site-packages/whatshap/__main__.py", line 7, in <module>
    import whatshap.cli as cli_package
  File "/opt/micromamba/envs/clairs-to/lib/python3.9/site-packages/whatshap/cli/__init__.py", line 5, in <module>
    from whatshap.bam import (
  File "/opt/micromamba/envs/clairs-to/lib/python3.9/site-packages/whatshap/bam.py", line 6, in <module>
    import pysam
  File "/rds/homes/b/beggsa/.local/lib/python3.9/site-packages/pysam/__init__.py", line 4, in <module>
    from pysam.libchtslib import *
ImportError: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.27' not found (required by /rds/homes/b/beggsa/.local/lib/python3.9/site-packages/pysam/../pysam.libs/libssh-f4fb36c6.so.4.8.7)
Traceback (most recent call last):
  File "/opt/micromamba/envs/clairs-to/bin/whatshap", line 7, in <module>
    from whatshap.__main__ import main
  File "/opt/micromamba/envs/clairs-to/lib/python3.9/site-packages/whatshap/__main__.py", line 7, in <module>
    import whatshap.cli as cli_package
  File "/opt/micromamba/envs/clairs-to/lib/python3.9/site-packages/whatshap/cli/__init__.py", line 5, in <module>
    from whatshap.bam import (
  File "/opt/micromamba/envs/clairs-to/lib/python3.9/site-packages/whatshap/bam.py", line 6, in <module>
    import pysam
  File "/rds/homes/b/beggsa/.local/lib/python3.9/site-packages/pysam/__init__.py", line 4, in <module>
    from pysam.libchtslib import *
ImportError: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.27' not found (required by /rds/homes/b/beggsa/.local/lib/python3.9/site-packages/pysam/../pysam.libs/libssh-f4fb36c6.so.4.8.7)

IndexError in STEP5

I am getting the following IndexError in STEP 5 while running ClairS_TO with the latest docker image and I am using PacBio Hifi data (haplotagged.bam file) as the input.

Traceback (most recent call last):
File "/opt/bin/clairs_to.py", line 107, in
main()
File "/opt/bin/clairs_to.py", line 101, in main
submodule.main()
File "/opt/bin/src/postprocess_vcf.py", line 231, in main
merge_vcf(args)
File "/opt/bin/src/postprocess_vcf.py", line 99, in merge_vcf
qual = float(columns[5])
IndexError: list index out of range

Even with the error, the final output file output.vcf.gz is generated but I am concerned about the output. Could you please help me with that?

hku-bal / clairs-to Goto Github PK

clairs-to's People

Contributors

Stargazers

Watchers

Forkers

clairs-to's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs