zengxiaofei / haphic Goto Github PK

View Code? Open in Web Editor NEW

116.0 4.0 7.0 42.91 MB

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data

License: BSD 3-Clause "New" or "Revised" License

Shell 0.25% Python 99.75%

assembly chromosome genome hi-c scaffolding haplotype phasing 3d-dna lachesis salsa

haphic's People

Contributors

Stargazers

Watchers

Forkers

yuzhenpeng wbglizhizhong altingia janeyang123 jwindler guanshaoheng guoshuai1314

haphic's Issues

Perform not so good for haplotype mixed assembly

Hi,

I am trying to use HapHiC to anchor a haplotype-collapsed genome (hifiasm p_ctg), and I found that HapHiC perform not so good, especially contig ordering,
HiC data was mapping following HiC-Pro 3 pipeline

here is HapHiC results heatmap

and here is EndHiC + manully curation results heatmap

If you need hic (or bam) and contig data for testing, I will upload to baidu netdisk and send you an e-mail (for tesing only).

Best,
Kun

403 FORBIDDEN occurred when installing dependencies using conda

您好，我在使用
conda env create -f HapHiC/conda_env/environment_py312.yml 为集群安装 HapHic 时出现了报错：
`Retrieving notices: ...working... done
Channels:

intel
conda-forge
defaults
http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
Platform: linux-64
Collecting package metadata (repodata.json): failed

UnavailableInvalidChannel: HTTP 403 FORBIDDEN for channel intel https://conda.anaconda.org/intel

The channel is not accessible or is invalid.

You will need to adjust your conda configuration to proceed.
Use conda config --show channels to view your configuration's current state,
and use conda config --show-sources to view config file locations.
`
为了能够成功安装，我尝试了使用micromanba，以及添加 "https://software.repos.intel.com/python/conda" ，但是都会出现相似的403报错，直接使用 micromamba 搜索 HapHic 也没有找到。

请问我该如何解决这个问题呢？

haphic plot

Hi Xiaofei,

When I used the haphic plot to generate the heatmap after manual curation, I found that the output was still the original. When I imported the reviewed assembly file into juicerbox, the heatmap kept unmodified as well. I checked the .assembly and .review.assembly files and both files are correct. Would you happen to know what's happening?

Thanks in advance for your help!

Best regards,
Yao

alternatives to Intel channel?

Hi,
I was wondering if there are alternatives to installing HipHiC without using the Intel channel. I am working in a cluster and I do not have permission to activate that channel. It should be possible to install mkl and others using other channels, but I prefer to ask because of version compatibility

Kind regards,
Diego

Issue of generate the final FASTA file after manual curation in Juicebox

Dear Xiaofei,

Thanks for providing this useful tool. I assembled the genome (2n=16, 8 chromosomes) by HiFi reads only using hifiasm.

stats for asm.bp.p_ctg.fasta
sum = 255015880, n = 387, ave = 658955.76, largest = 30935753
N50 = 11339244, n = 7
N60 = 10070553, n = 9
N70 = 9531125, n = 12
N80 = 7505767, n = 15
N90 = 4309235, n = 19
N100 = 20579, n = 387
N_count = 0
Gaps = 0

Then I run HapHiC with one-line command: /path/to/HapHiC/haphic pipeline asm.bp.p_ctg.fasta HiC.filtered.bam 8
and then I use $bash juicebox.sh to generate the .hic and .assembly for manual curation.
after manual curation in Juicebox, I run /path/to/HapHiC/utils/juicer post -o out_JBAT out_JBAT.review.assembly out_JBAT.liftover.agp asm.bp.p_ctg.fasta to generate the final FASTA file.

I found issue of the FINAL FASTA file.
stats for out_JBAT.FINAL.fa
sum = 372880511, n = 9, ave = 41431167.89, largest = 136096530
N50 = 33899492, n = 3
N60 = 30935753, n = 4
N70 = 30117002, n = 5
N80 = 29705702, n = 7
N90 = 24223963, n = 8
N100 = 23588956, n = 9
N_count = 38700
Gaps = 387

I do not know why the total bases increased a lot. The size for scaffolds 1-8 (~30 Mb) looks good according to the final heatmap (attachment), but there is a huge scaffold 9 which is 136 Mb. I do not know how this scaffold 9 was created. Do you know if the scaffold 1-8 sequence is correct?
contact_map.pdf
separate_plots.pdf

I do not have any problem when running the pipeline, no error report. I also made the plot use out_JBAT.FINAL.agp, it also looks good to me.

Thanks,
Haoran

Failed to use haphic.

Hi. I used the following command to mount my genome 'haphic pipeline asm_ctgs_m.fa HiC.filtered.bam 123 --correct_nrounds 2 --remove_allelic_links 3 --threads 40 --processes 40'. The program encountered a problem after the clusters step, and it seems like the grouping failed. However, I am also unable to use ALLHIC to group the sequences. This genome is considered to be an AAB type of triploid, and I have already used 3d-dna and yahs to mount it, managing to separate out 123 chromosomes. I have tried three types of assembly modes: hifiasm hap1+hap2, hicanu, and peregrine, but none of these assemblies could successfully run haphic. Could you provide some advice, thank you.

peregrine + yahs

hifiasm hap merge + yahs

haphic log

2024-05-19 15:48:00 <HapHiC_pipeline.py> [main] Pipeline started, HapHiC version: 1.0.3 (update: 2024.05.08)
2024-05-19 15:48:00 <HapHiC_pipeline.py> [main] Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
2024-05-19 15:48:00 <HapHiC_pipeline.py> [haphic_cluster] Step1: Execute preprocessing and Markov clustering for contigs...
2024-05-19 15:48:01 <HapHiC_cluster.py> [run] Program started, HapHiC version: 1.0.3 (update: 2024.05.08)
2024-05-19 15:48:01 <HapHiC_cluster.py> [run] Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
2024-05-19 15:48:01 <HapHiC_cluster.py> [detect_format] The file for Hi-C read alignments is detected as being in BAM format
2024-05-19 15:48:01 <HapHiC_cluster.py> [run] Ultra-long data are not supported now when assembly correction is enabled
2024-05-19 15:48:01 <HapHiC_cluster.py> [parse_fasta] Parsing input FASTA file...
2024-05-19 15:48:13 <HapHiC_cluster.py> [parse_bam_for_correction] Parsing input BAM file for contig correction...
2024-05-19 16:07:10 <HapHiC_cluster.py> [correct_assembly] Performing assembly correction...
2024-05-19 16:41:32 <HapHiC_cluster.py> [correct_assembly] Correction round 1, breakpoints are detected in 11 contig(s)
2024-05-19 16:41:32 <HapHiC_cluster.py> [break_and_update_ctgs] Breaking contigs and updating data...
2024-05-19 16:41:32 <HapHiC_cluster.py> [correct_assembly] Correction round 2, breakpoints are detected in 7 contig(s)
2024-05-19 16:41:32 <HapHiC_cluster.py> [break_and_update_ctgs] Breaking contigs and updating data...
2024-05-19 16:41:32 <HapHiC_cluster.py> [correct_assembly] Generating corrected assembly file...
2024-05-19 16:41:32 <HapHiC_cluster.py> [correct_assembly] 11 contigs were broken into 29 contigs. Writing corrected assembly to corrected_asm.fa...
2024-05-19 16:41:43 <HapHiC_cluster.py> [stat_fragments] Making some statistics of fragments (contigs / bins)
2024-05-19 16:41:43 <HapHiC_cluster.py> [stat_fragments] bin_size is calculated to be 742887 bp
2024-05-19 16:41:52 <HapHiC_cluster.py> [parse_alignments] Parsing input alignments...
2024-05-19 16:53:43 <HapHiC_cluster.py> [output_pickle] Writing HT_link_dict to HT_links.pkl...
2024-05-19 16:53:43 <HapHiC_cluster.py> [output_clm] Writing clm_dict to paired_links.clm...
2024-05-19 16:53:43 <HapHiC_cluster.py> [filter_fragments] Filtering fragments...
2024-05-19 16:53:43 <HapHiC_cluster.py> [filter_fragments] [Nx filtering] 2953 fragments kept
2024-05-19 16:53:43 <HapHiC_cluster.py> [filter_fragments] [RE sites filtering] 0 fragments removed, 2953 fragments kept
2024-05-19 16:53:43 <HapHiC_cluster.py> [filter_fragments] [link density filtering] Parameter --density_lower 0.2X is set to "multiple" mode and equivalent to 0.9979681679647816 in "fraction" mode
2024-05-19 16:53:43 <HapHiC_cluster.py> [filter_fragments] [link density filtering] Parameter --density_upper 1.9X is set to "multiple" mode and equivalent to 1.0 in "fraction" mode
2024-05-19 16:53:43 <HapHiC_cluster.py> [filter_fragments] [link density filtering] 2947 fragments removed, 6 fragments kept
2024-05-19 16:53:43 <HapHiC_cluster.py> [filter_fragments] [rank sum filtering] Q1=26.0, median=26.0, Q3=26.0, IQR=Q3-Q1=0.0
2024-05-19 16:53:43 <HapHiC_cluster.py> [filter_fragments] [rank sum filtering] Parameter --rank_sum_upper 1.5X is set to "multiple" mode and equivalent to 1.0 in "fraction" mode
2024-05-19 16:53:43 <HapHiC_cluster.py> [filter_fragments] [rank sum filtering] 0 fragments removed, 6 fragments kept
2024-05-19 16:53:43 <HapHiC_cluster.py> [remove_allelic_HiC_links] Removing Hi-C links between alleic contig pairs...
2024-05-19 16:53:43 <HapHiC_cluster.py> [remove_allelic_HiC_links] Removing isolated fragments after filtering out allelic Hi-C links...
2024-05-19 16:53:43 <HapHiC_cluster.py> [remove_allelic_HiC_links] 0 fragments removed, 6 fragments kept
2024-05-19 16:53:43 <HapHiC_cluster.py> [output_pickle] Writing full_link_dict to full_links.pkl...
2024-05-19 16:53:43 <HapHiC_cluster.py> [run] Hi-C linking matrix was constructed in 3942.0703554153442s
2024-05-19 16:53:43 <HapHiC_cluster.py> [run_mcl_clustering] Performing Markov clustering...
2024-05-19 16:53:43 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 1.1, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:43 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 1.2, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:43 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 1.3, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:44 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 1.4, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:45 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 1.5, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:45 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 1.6, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:46 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 1.7, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:46 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 1.8, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:46 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 1.9, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:47 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 2.0, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:47 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 2.1, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:48 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 2.2, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:49 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 2.3, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:49 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 2.4, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:49 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 2.5, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:50 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 2.6, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:50 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 2.7, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:51 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 2.8, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:51 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 2.9, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:52 <HapHiC_cluster.py> [mcl] The matrix has converged after 3 rounds of iterations (expansion: 2, inflation: 3.0, maximum iterations: 200, pruning threshold: 0.0001)
2024-05-19 16:53:52 <HapHiC_cluster.py> [run_mcl_clustering] The maximum number of clusters (3) is even less than the expected number of chromosomes (123). You could try higher inflation.
2024-05-19 16:53:52 <HapHiC_cluster.py> [run] 20 round(s) of Markov clustering finished in 9.17163634300232s, average 0.458581817150116s per round
2024-05-19 16:53:52 <HapHiC_cluster.py> [output_statistics] Making some statistics for the next HapHiC reassignment step...
2024-05-19 16:54:11 <HapHiC_cluster.py> [run] Program finished in 3970.5241661071777s
Traceback (most recent call last):
  File "/home/sofware/HapHiC-main/scripts/HapHiC_pipeline.py", line 517, in <module>
    main()
  File "/home/sofware/HapHiC-main/scripts/HapHiC_pipeline.py", line 498, in main
    haphic_cluster(args)
  File "/home/sofware/HapHiC-main/scripts/HapHiC_pipeline.py", line 390, in haphic_cluster
    raise RuntimeError(
RuntimeError: Pipeline Abortion: Inflation recommendation failed. It seems that some chromosomes were grouped together, or the maximum number of clusters is even less than the expected number of chromosomes. For 
more details, please check out the logs.
Traceback (most recent call last):
  File "/home/sofware/HapHiC-main/haphic", line 110, in <module>
    subprocess.run(commands, check=True)
  File "/home/micromamba/envs/haphic/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/home/sofware/HapHiC-main/scripts/HapHiC_pipeline.py', 'asm_ctgs_m.fa', 'HiC.filtered.bam', '123', '--correct_nrounds', '2', '--remove_allelic_
links', '3', '--threads', '40', '--processes', '40']' returned non-zero exit status 1.

N region in plot

Hi,

I found that haphic plot deal with signal in N region in a wrong way,

here is haphic plot (N region is so heat),

it should be

Heat signal in N region may misleading readers some times, I think.

Best,
Kun

Question regarding Trio Binning files

Hi,

I am working with Trio Binning data, and I am wondering which hifiasm output I should use in HapHiC.

I concatenated hap1(.dip.hap1.p_ctg.fa) and hap2 (.dip.hap2.p_ctg.fa) contigs into a single file to use as HapHiC input.

Is that the best strategy?

Thank you in advance,

Carolina

How to add/remove elements to a haphic plot?

Hello, @zengxiaofei
Thank you for developing this software.
I am new to Haphic and have some questions about the plot function.I ran the following command: haphic plot out_JBAT.assembly.agp HiC.filtered.bam --bin_size 500 --min_len 5 --normalization log10 --border_style outline &
The generated plot is attached.

I am wondering how to generate a plot with group information on the y-axis, similar to the example you provided.
Thank you for your help.

The maximum number of clusters is even less than the expected number of chromosomes.

Hello, Dear Developer,

Thank you your effort on developing the HapHiC.
I think the problem started in #15
1.The karyotype of the species：
2n=2x=18，plant.
2.My commands are as follows:

contig=XX.hap1_hap2.fa
hicread1=/public/data/quercifolia-XX/HiC/Unknown_BU274-001H0001_good_fastp_1.fq.gz
hicread2=/public/data/quercifolia-XX/HiC/Unknown_BU274-001H0001_good_fastp_2.fq.gz

ln -s $contig asm.fa
ln -s $hicread1 read1_fq.gz
ln -s $hicread2 read2_fq.gz



source /public/home/mambaforge/bin/activate haphic
# (1) Align Hi-C data to the assembly, remove PCR duplicates and filter out secondary and supplementary alignments
# bwa index asm.fa
 bwa mem -t 80 -5SP asm.fa read1_fq.gz read2_fq.gz | samblaster | samtools view - -@ 80 -S -h -b -F 3340 -o HiC.bam

# (2) Filter the alignments with MAPQ 1 (mapping quality ≥ 1) and NM 3 (edit distance < 3)
 filter_bam HiC.bam 1 --nm 3 --threads 80 | samtools view - -b -@ 80 -o HiC.filtered.bam



source /public/home/off/mambaforge/bin/activate haphic

haphic pipeline asm.fa HiC.filtered.bam 18  --RE "AAGCTT" --Nx 100 --min_group_len 1

This is a diploid genome with a low heterozygosity rate, around 0.5%.The XX.hap1_hap2.fa come from:

cat quer.XX.asm.hic.hap1.p_ctg.fa quer.XX.asm.hic.hap2.p_ctg.fa > XX.hap1_hap2.fa

3.And my 01.cluster always erro.When running, the following log is displayed:

cat HapHiC_cluster.log
2024-07-29 17:58:04 <HapHiC_cluster.py> [run] Program started, HapHiC version: 1.0.3 (update: 2024.03.26)
2024-07-29 17:58:04 <HapHiC_cluster.py> [run] Python version: 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0]
2024-07-29 17:58:04 <HapHiC_cluster.py> [detect_format] The file for Hi-C read alignments is detected as being in BAM format
2024-07-29 17:58:04 <HapHiC_cluster.py> [parse_fasta] Parsing input FASTA file...
2024-07-29 17:58:13 <HapHiC_cluster.py> [stat_fragments] Making some statistics of fragments (contigs / bins)
2024-07-29 17:58:13 <HapHiC_cluster.py> [stat_fragments] bin_size is calculated to be 1416101 bp
2024-07-29 17:58:15 <HapHiC_cluster.py> [parse_alignments] Parsing input alignments...
2024-07-29 17:58:35 <HapHiC_cluster.py> [output_pickle] Writing HT_link_dict to HT_links.pkl...
2024-07-29 17:58:36 <HapHiC_cluster.py> [output_clm] Writing clm_dict to paired_links.clm...
2024-07-29 17:58:37 <HapHiC_cluster.py> [filter_fragments] Filtering fragments...
2024-07-29 17:58:37 <HapHiC_cluster.py> [filter_fragments] [Nx filtering] 1333 fragments kept
2024-07-29 17:58:37 <HapHiC_cluster.py> [filter_fragments] [RE sites filtering] 105 fragments removed, 1228 fragments kept
2024-07-29 17:58:37 <HapHiC_cluster.py> [filter_fragments] [link density filtering] Parameter --density_lower 0.2X is set to "multiple" mode and equivalent to 0.6726384364820847 in "fraction" mode
2024-07-29 17:58:37 <HapHiC_cluster.py> [filter_fragments] [link density filtering] Parameter --density_upper 1.9X is set to "multiple" mode and equivalent to 0.9666123778501629 in "fraction" mode
2024-07-29 17:58:37 <HapHiC_cluster.py> [filter_fragments] [link density filtering] 867 fragments removed, 361 fragments kept
2024-07-29 17:58:37 <HapHiC_cluster.py> [filter_fragments] [rank sum filtering] Q1=3402.0, median=3978.0, Q3=4555.0, IQR=Q3-Q1=1153.0
2024-07-29 17:58:37 <HapHiC_cluster.py> [filter_fragments] [rank sum filtering] Parameter --rank_sum_upper 1.5X is set to "multiple" mode and equivalent to 0.9944598337950139 in "fraction" mode
2024-07-29 17:58:37 <HapHiC_cluster.py> [filter_fragments] [rank sum filtering] 2 fragments removed, 359 fragments kept
2024-07-29 17:58:37 <HapHiC_cluster.py> [output_pickle] Writing full_link_dict to full_links.pkl...
2024-07-29 17:58:37 <HapHiC_cluster.py> [run] Hi-C linking matrix was constructed in 32.92142605781555s
2024-07-29 17:58:37 <HapHiC_cluster.py> [run_mcl_clustering] Performing Markov clustering...
2024-07-29 17:58:39 <HapHiC_cluster.py> [mcl] The matrix has converged after 60 rounds of iterations (expansion: 2, inflation: 1.1, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:39 <HapHiC_cluster.py> [mcl] The matrix has converged after 32 rounds of iterations (expansion: 2, inflation: 1.2, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:39 <HapHiC_cluster.py> [mcl] The matrix has converged after 23 rounds of iterations (expansion: 2, inflation: 1.3, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:40 <HapHiC_cluster.py> [mcl] The matrix has converged after 18 rounds of iterations (expansion: 2, inflation: 1.4, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:40 <HapHiC_cluster.py> [mcl] The matrix has converged after 15 rounds of iterations (expansion: 2, inflation: 1.5, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:40 <HapHiC_cluster.py> [mcl] The matrix has converged after 13 rounds of iterations (expansion: 2, inflation: 1.6, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:40 <HapHiC_cluster.py> [mcl] The matrix has converged after 12 rounds of iterations (expansion: 2, inflation: 1.7, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:40 <HapHiC_cluster.py> [mcl] The matrix has converged after 10 rounds of iterations (expansion: 2, inflation: 1.8, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:40 <HapHiC_cluster.py> [mcl] The matrix has converged after 10 rounds of iterations (expansion: 2, inflation: 1.9, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:41 <HapHiC_cluster.py> [mcl] The matrix has converged after 9 rounds of iterations (expansion: 2, inflation: 2.0, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:41 <HapHiC_cluster.py> [mcl] The matrix has converged after 8 rounds of iterations (expansion: 2, inflation: 2.1, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:41 <HapHiC_cluster.py> [mcl] The matrix has converged after 10 rounds of iterations (expansion: 2, inflation: 2.2, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:41 <HapHiC_cluster.py> [mcl] The matrix has converged after 7 rounds of iterations (expansion: 2, inflation: 2.3, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:41 <HapHiC_cluster.py> [mcl] The matrix has converged after 10 rounds of iterations (expansion: 2, inflation: 2.4, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:41 <HapHiC_cluster.py> [mcl] The matrix has converged after 7 rounds of iterations (expansion: 2, inflation: 2.5, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:41 <HapHiC_cluster.py> [mcl] The matrix has converged after 9 rounds of iterations (expansion: 2, inflation: 2.6, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:41 <HapHiC_cluster.py> [mcl] The matrix has converged after 8 rounds of iterations (expansion: 2, inflation: 2.7, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:41 <HapHiC_cluster.py> [mcl] The matrix has converged after 8 rounds of iterations (expansion: 2, inflation: 2.8, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:41 <HapHiC_cluster.py> [mcl] The matrix has converged after 8 rounds of iterations (expansion: 2, inflation: 2.9, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:41 <HapHiC_cluster.py> [mcl] The matrix has converged after 8 rounds of iterations (expansion: 2, inflation: 3.0, maximum iterations: 200, pruning threshold: 0.0001)
2024-07-29 17:58:41 <HapHiC_cluster.py> [run_mcl_clustering] The maximum number of clusters (1) is even less than the expected number of chromosomes (18). You could try higher inflation.
2024-07-29 17:58:41 <HapHiC_cluster.py> [run] 20 round(s) of Markov clustering finished in 4.332892656326294s, average 0.2166446328163147s per round
2024-07-29 17:58:41 <HapHiC_cluster.py> [output_statistics] Making some statistics for the next HapHiC reassignment step...
2024-07-29 17:59:03 <HapHiC_cluster.py> [run] Program finished in 58.50718808174133s

The HiC.filtered.bam:

 samtools view HiC.filtered.bam | cut -f 1-5 | head -20
LH00330:133:222K77LT4:3:1101:0:19472    81      h1tg000012l     2194749 8
LH00330:133:222K77LT4:3:1101:0:19472    161     h1tg000012l     2194704 8
LH00330:133:222K77LT4:3:1101:0:20479    81      h1tg000012l     1773068 60
LH00330:133:222K77LT4:3:1101:0:20479    161     h1tg000012l     1773006 60
LH00330:133:222K77LT4:3:1101:0:22016    97      h1tg000028l     9173384 8
LH00330:133:222K77LT4:3:1101:0:22016    145     h1tg000028l     9173551 9
LH00330:133:222K77LT4:3:1101:0:22341    81      h1tg000012l     5980285 31
LH00330:133:222K77LT4:3:1101:0:22341    161     h1tg000012l     5980155 39
LH00330:133:222K77LT4:3:1101:0:24216    97      h1tg000024l     4949822 60
LH00330:133:222K77LT4:3:1101:0:24216    145     h1tg000024l     4949948 60
LH00330:133:222K77LT4:3:1101:0:27048    81      h2tg000005l     6148297 8
LH00330:133:222K77LT4:3:1101:0:27048    161     h2tg000005l     6148247 8
LH00330:133:222K77LT4:3:1101:0:27837    97      h2tg000002l     31226689        8
LH00330:133:222K77LT4:3:1101:0:27837    145     h2tg000002l     31226820        8
LH00330:133:222K77LT4:3:1101:0:29062    81      h1tg000028l     4129513 21
LH00330:133:222K77LT4:3:1101:0:29062    161     h1tg000028l     4129420 28
LH00330:133:222K77LT4:3:1101:0:29686    81      h1tg000005l     8032493 1
LH00330:133:222K77LT4:3:1101:0:29686    161     h1tg000005l     8032493 2
LH00330:133:222K77LT4:3:1101:0:31360    81      h2tg000013l     11094454        21
LH00330:133:222K77LT4:3:1101:0:31360    161     h2tg000013l     11094235        60

4.The methods (commands) used for Hi-C read mapping and filtering.

# (2) Filter the alignments with MAPQ 1 (mapping quality ≥ 1) and NM 3 (edit distance < 3)
 filter_bam HiC.bam 1 --nm 3 --threads 80 | samtools view - -b -@ 80 -o HiC.filtered.bam

5.The method used for genome assembly (e.g., hifiasm + Hi-C) and the assembly utilized for scaffolding (e.g., p_ctg, hap*.p_ctg or p_utg):
I used hap*.p_ctg(cat hap1.bp.p_ctg.fa hap2.bp.p_ctg.fa >hap1_hap2.fa) .And I just used bp.p_ctg.fa as contig.fa and the same error was reported that 9 clusters could not be obtained.The same erro outs for using hifiasm + Hi-C to obtain hap*.p_ctg or p_ctg as asm.fa.

6.Statistics for the assembly input into HapHiC

#### base statistics ####
A: 248645956, T: 248880267, C: 133895942, G: 133272425, N: 0
GC%: 0.34937917764005627

#### contig statistics ####
Nx      Number  Length
N10     3       36168237
N20     5       35644004
N30     7       30857360
N40     11      18353511
N50     15      17326354
N60     19      16175814
N70     25      12363138
N80     31      10390621
N90     60      668910
longest contig: 38426148, shortest contig: 17553
total number: 877, total length: 764694590

#### scaffold statistics ####
Nx      Number  Length
N10     3       36168237
N20     5       35644004
N30     7       30857360
N40     11      18353511
N50     15      17326354
N60     19      16175814
N70     25      12363138
N80     31      10390621
N90     60      668910
longest scaffold: 38426148, shortest scaffold: 17553
total number: 877, total length: 764694590

#### gapless scaffold statistics ####
Nx      Number  Length
N10     3       36168237
N20     5       35644004
N30     7       30857360
N40     11      18353511
N50     15      17326354
N60     19      16175814
N70     25      12363138
N80     31      10390621
N90     60      668910
longest gapless scaffold: 38426148, shortest gapless scaffold: 17553
total number: 877, total length: 764694590

I tried to use hap1 or p_ctg.fa to run haphic pipeline defulat，But in the first step 01.cluster reported an error that the number of clustering groups was less than the number of nchr. The hifiasm +hic log file and fa_detail file are as follows:
HapHiC_cluster_hifiasm(hic)_pctg.log
asm.hic.p_ctg_fa_detail.log

Then,I After I add the following parameters, the output 02.reassign interrupts the operation. The log file is as follows：
HapHiC_cluster.log
HapHiC_reassign.log

How should I adjust the parameters? Thank you again.

Is it normal for there to be interactions between haps?

Q:
Haphic is very convenient for processing Hi-C data.
We combined the data of two haps together and then processed it with haphic. The resulting Hi-C heatmap shows some strong interactions between the two haps (as shown in the figure). This situation occurs with both --phasing_weight 0 and 1. I would like to ask if this is normal?

The karyotype of the species: 2n=100
The commands and parameters used in running HapHiC: haphic pipeline genome.fas HiC.filtered.bam ${nchrs} --quick_view --threads 20 --processes 60 --gfa "hap1.p_ctg.gfa,hap2.p_ctg.gfa" --phasing_weight 0 --correct_nrounds 2
The log files generated by HapHiC:
2024-06-27 08:50:28 <HapHiC_pipeline.py> [main] Pipeline started, HapHiC version: 1.0.3 (update: 2024.05.20)
2024-06-27 08:50:28 <HapHiC_pipeline.py> [main] Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
2024-06-27 08:50:28 <HapHiC_pipeline.py> [haphic_cluster] Step1: Execute preprocessing and Markov clustering for contigs...
2024-06-27 08:50:28 <HapHiC_cluster.py> [run] Program started, HapHiC version: 1.0.3 (update: 2024.05.20)
2024-06-27 08:50:28 <HapHiC_cluster.py> [run] Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
2024-06-27 08:50:28 <HapHiC_cluster.py> [detect_format] The file for Hi-C read alignments is detected as being in BAM format
2024-06-27 08:50:28 <HapHiC_cluster.py> [run] Ultra-long data are not supported now when assembly correction is enabled
2024-06-27 08:50:28 <HapHiC_cluster.py> [parse_fasta] Parsing input FASTA file...
2024-06-27 08:51:14 <HapHiC_cluster.py> [parse_gfa] Parsing input gfa file(s)...
2024-06-27 08:51:41 <HapHiC_cluster.py> [parse_bam_for_correction] Parsing input BAM file for contig correction...
2024-06-27 09:21:02 <HapHiC_cluster.py> [correct_assembly] Performing assembly correction...
2024-06-27 09:22:31 <HapHiC_cluster.py> [correct_assembly] Correction round 1, breakpoints are detected in 6 contig(s)
2024-06-27 09:22:31 <HapHiC_cluster.py> [break_and_update_ctgs] Breaking contigs and updating data...
2024-06-27 09:22:35 <HapHiC_cluster.py> [correct_assembly] Correction round 2, breakpoints are detected in 0 contig(s)
2024-06-27 09:22:35 <HapHiC_cluster.py> [correct_assembly] Generating corrected assembly file...
2024-06-27 09:22:35 <HapHiC_cluster.py> [correct_assembly] 6 contigs were broken into 12 contigs. Writing corrected assembly to corrected_asm.fa...
2024-06-27 09:22:59 <HapHiC_cluster.py> [stat_fragments] Making some statistics of fragments (contigs / bins)
2024-06-27 09:22:59 <HapHiC_cluster.py> [stat_fragments] bin_size is set to 0, no fragments will be split
2024-06-27 09:23:01 <HapHiC_cluster.py> [parse_alignments_for_ctgs] Parsing input alignments...
2024-06-27 09:41:54 <HapHiC_cluster.py> [output_pickle] Writing HT_link_dict to HT_links.pkl...
2024-06-27 09:41:55 <HapHiC_cluster.py> [run] Program finished in 3086.3505272865295s
2024-06-27 09:41:55 <HapHiC_pipeline.py> [haphic_reassign] Step2: Reassign and rescue contigs...
2024-06-27 09:41:55 <HapHiC_reassign.py> [run] Program started, HapHiC version: 1.0.3 (update: 2024.05.20)
2024-06-27 09:41:55 <HapHiC_reassign.py> [run] Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
2024-06-27 09:41:55 <HapHiC_cluster.py> [parse_fasta] Parsing input FASTA file...
2024-06-27 09:42:18 <HapHiC_cluster.py> [parse_gfa] Parsing input gfa file(s)...
2024-06-27 09:42:18 <HapHiC_reassign.py> [run] Program finished in 23.16846799850464s
2024-06-27 09:42:18 <HapHiC_pipeline.py> [haphic_sort] Step3: Order and orient contigs within each group...
2024-06-27 09:42:18 <HapHiC_sort.py> [run] Program started, HapHiC version: 1.0.3 (update: 2024.05.20)
2024-06-27 09:42:18 <HapHiC_sort.py> [run] Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
2024-06-27 09:42:18 <HapHiC_sort.py> [run] Checking the path of ALLHiC...
2024-06-27 09:42:18 <HapHiC_sort.py> [run] ALLHiC has been found in /mnt/01_srcs/xxx/HapHiC/scripts
2024-06-27 09:42:18 <HapHiC_sort.py> [parse_fasta] Parsing fasta file...
2024-06-27 09:42:26 <HapHiC_sort.py> [run] Loading input pickle file...
2024-06-27 09:42:26 <HapHiC_sort.py> [run] Parsing group files and clm files...
2024-06-27 09:42:28 <HapHiC_sort.py> [run] Program will be executed in multiprocessing mode (processes=2)
2024-06-27 09:42:28 <HapHiC_sort.py> [fast_sort] [group1_1461626638bp] Performing fast sorting...
2024-06-27 09:42:28 <HapHiC_sort.py> [fast_sort] [group1_1461626638bp] Checking the content of input group file...
2024-06-27 09:42:28 <HapHiC_sort.py> [fast_sort] [group1_1461626638bp] Starting fast sorting iterations...
2024-06-27 09:42:28 <HapHiC_sort.py> [fast_sort] [group2_1792589528bp] Performing fast sorting...
2024-06-27 09:42:28 <HapHiC_sort.py> [fast_sort] [group2_1792589528bp] Checking the content of input group file...
2024-06-27 09:42:28 <HapHiC_sort.py> [fast_sort] [group2_1792589528bp] Starting fast sorting iterations...
<class 'networkx.utils.decorators.argmap'> compilation 34:3: FutureWarning:

shortest_path will return an iterator that yields
(node, path) pairs instead of a dictionary when source
and target are unspecified beginning in version 3.5

To keep the current behavior, use:

    dict(nx.shortest_path(G))

<class 'networkx.utils.decorators.argmap'> compilation 34:3: FutureWarning:

shortest_path will return an iterator that yields
(node, path) pairs instead of a dictionary when source
and target are unspecified beginning in version 3.5

To keep the current behavior, use:

    dict(nx.shortest_path(G))

2024-06-27 10:00:37 <HapHiC_sort.py> [run] Program finished in 1098.1858050823212s
2024-06-27 10:00:37 <HapHiC_pipeline.py> [haphic_build] Step4: Build final scaffolds (pseudomolecules)...
2024-06-27 10:00:37 <HapHiC_build.py> [run] Program started, HapHiC version: 1.0.3 (update: 2024.05.20)
2024-06-27 10:00:37 <HapHiC_build.py> [run] Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
2024-06-27 10:00:37 <HapHiC_cluster.py> [parse_fasta] Parsing input FASTA file...
2024-06-27 10:01:01 <HapHiC_build.py> [parse_tours] Parsing tour files...
2024-06-27 10:01:01 <HapHiC_build.py> [build_final_scaffolds] Building final scaffolds...
2024-06-27 10:01:43 <HapHiC_build.py> [run] Program finished in 66.46069979667664s
2024-06-27 10:01:43 <HapHiC_pipeline.py> [main] HapHiC pipeline finished in 4274.942216873169s

The methods (commands) used for Hi-C read mapping and filtering:
bwa mem -5SP -t 60 genome.fas ${read1} ${read2} | samblaster | samtools view - -@ 20 -S -h -b -F 3340 -o HiC.bam
filter_bam HiC.bam 1 --nm 3 --threads 20 | samtools view - -b -@ 20 -o HiC.filtered.bam

The method used for genome assembly (e.g., hifiasm + Hi-C) and the assembly utilized for scaffolding (e.g., p_ctg, hap.p_ctg or p_utg):*
hifiasm -t 60 --h1 s_1.fq.gz --h2 s_2.fq.gz ./ccs.fastq

Statistics for the assembly input into HapHiC (e.g., N10-N90 and L10-L90):
#hap1:
StatType ContigLength ContigNumber
N50 14292390 32
N60 12748816 42
N70 9949659 55
N80 6898132 73
N90 3149213 103
Longest 46849406 1
Total 1461626638 985
Length>=1kb 1461626638 985
Length>=2kb 1461626638 985
Length>=5kb 1461626638 985
#hap2:
StatType ContigLength ContigNumber
N50 16020022 37
N60 12619360 50
N70 9100361 66
N80 6754290 89
N90 3293374 127
Longest 43036230 1
Total 1792589528 727
Length>=1kb 1792589528 727
Length>=2kb 1792589528 727
Length>=5kb 1792589528 727

scikit-learn error, ImportError: dlopen: cannot load any more object with static TLS

Hi,
When I run the haphic pipeline, it occurs the following error:

(haphic) [cuixb@yls allhic]$ ~/tools/biosoft/HapHiC-main/haphic pipeline -h
Traceback (most recent call last):
  File "/home/cuixb/tools/biosoft/conda3/envs/haphic/lib/python3.10/site-packages/sklearn/__check_build/__init__.py", line 48, in <module>
    from ._check_build import check_build  # noqa
ImportError: dlopen: cannot load any more object with static TLS

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/raid8/cuixb/tools/biosoft/HapHiC-main/scripts/HapHiC_pipeline.py", line 17, in <module>
    import HapHiC_cluster
  File "/raid8/cuixb/tools/biosoft/HapHiC-main/scripts/HapHiC_cluster.py", line 31, in <module>
    from sklearn.preprocessing import normalize
  File "/home/cuixb/tools/biosoft/conda3/envs/haphic/lib/python3.10/site-packages/sklearn/__init__.py", line 81, in <module>
    from . import __check_build  # noqa: F401
  File "/home/cuixb/tools/biosoft/conda3/envs/haphic/lib/python3.10/site-packages/sklearn/__check_build/__init__.py", line 50, in <module>
    raise_build_error(e)
  File "/home/cuixb/tools/biosoft/conda3/envs/haphic/lib/python3.10/site-packages/sklearn/__check_build/__init__.py", line 31, in raise_build_error
    raise ImportError(
ImportError: dlopen: cannot load any more object with static TLS
___________________________________________________________________________
Contents of /home/cuixb/tools/biosoft/conda3/envs/haphic/lib/python3.10/site-packages/sklearn/__check_build:
_check_build.cpython-310-x86_64-linux-gnu.so__init__.py               setup.py
__pycache__
___________________________________________________________________________
It seems that scikit-learn has not been built correctly.

If you have installed scikit-learn from source, please do not forget
to build the package before using it: run `python setup.py install` or
`make` in the source directory.

If you have used an installer, please check that it is suited for your
Python version, your operating system and your platform.
Traceback (most recent call last):
  File "/home/cuixb/tools/biosoft/HapHiC-main/haphic", line 96, in <module>
    subprocess.run(commands, check=True)
  File "/home/cuixb/tools/biosoft/conda3/envs/haphic/lib/python3.10/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/raid8/cuixb/tools/biosoft/HapHiC-main/scripts/HapHiC_pipeline.py', '-h']' returned non-zero exit status 1.

And I installed Haphic env as your recommended: conda env create -f HapHiC/conda_env/environment_py310.yml.

Number of chromosmes if using two pseudo haplotypes from hifiasm

Hi,
If I combine hap1 and hap2 from hifiasm for scaffolding. Do I have to set the number of chromosomes to the haploid or diploid number?
Best regards,
Christian

Important: More information allows me to grasp your issue faster

Often, I cannot fully grasp the issue solely from an error message.

To help me better understand the problem you are facing, please provide the following information as much as possible:

The karyotype of the species (e.g., 2n=4x=32).
The commands and parameters used in running HapHiC.
The log files generated by HapHiC.
The methods (commands) used for Hi-C read mapping and filtering.
The method used for genome assembly (e.g., hifiasm + Hi-C) and the assembly utilized for scaffolding (e.g., p_ctg, hap*.p_ctg or p_utg)
Statistics for the assembly input into HapHiC (e.g., N10-N90 and L10-L90); you can calculate them using this tool:

$ gunzip fa_detail.gz
$ chmod 755 fa_detail
$ fa_detail asm.fa

Remember, saving time is saving lives.

How to improve the haphic result for polyploid?

Hello Xiaofei,

Thank you for developing the Haphic software.

I am currently working on the assembly of a complex polyploid genome and I am trying to scaffold the “p_utg” sequences obtained from hifiasm to the haplotype-resolved level. The first issue I encountered is determining the inflation value. I have tried many different inflation values, but the clustering results never match the chromosome number of my species (which has more than 260 chromosomes).

When running haphic by default parameters (inflation=2.6, clusters=222), I observed that some scaffolds still show signs of heterozygosity and collapse when checked with Juicebox.

Using minimap2, I aligned these scaffolds to the genome of a closely related species and generated dot plots, which reveal obvious 4:1 and 8:1 alignment results, but also some scaffolding errors.

Could you please provide some suggestions for improving this genome assembly? Thank you.

Best regards,

Xiaoyu

Error: died with <Signals.SIGKILL: 9

Hello Xiaofei,

I could not go through the haphic pipeline with this command haphic pipeline scaffolds.ref.fa HiC.filtered.bam 20 --RE GATC,GANTC,CTNAG,TTAA --correct_nrounds 2 --remove_allelic_links 4 --threads 20 --processes 8. I encountered this error message:

Traceback (most recent call last):
  File "/home/li/bioinfo/HapHiC/haphic", line 96, in <module>
    subprocess.run(commands, check=True)
  File "/home/li/miniforge3/envs/haphic/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/home/li/bioinfo/HapHiC/scripts/HapHiC_pipeline.py', 'scaffolds.ref.fa', 'HiC.filtered.bam', '20', '--RE', 'GATC,GANTC,CTNAG,TTAA', '--correct_nrounds', '2', '--remove_allelic_links', '4', '--threads', '20', '--processes', '8']' died with <Signals.SIGKILL: 9>.

Could you please give me any suggestions?
Thank you for your time.
Zhen

The question about regenerating a HiC contact map for the final assembly

Hello developer,

After successfully running the Step1 and Step2, I got the two files ( out_JBAT.FINAL.agp and out_JBAT.FINAL.fa). I want to regenerate a .hic file corresponding to the final assembly. I'm not sure if the input files I chose to complete the Step 3 are correct ( the original BED/BAM file, contigs.fa.fai, out_JBAT.FINAL.agp and out_JBAT.FINAL.fa.chrom.sizes). I would be very grateful if you could provide some help! my commands are as follows.

Step1. juicerbox.sh

/software/HapHiC/scripts/../utils/juicer pre -a -q 1 -o out_JBAT samblasterHiCfilter2.bed scaffolds.raw.agp asm.fa.fai >out_JBAT.log 2>&1
(java -jar -Xmx32G /software/HapHiC/scripts/../utils/juicer_tools.1.9.9_jcuda.0.8.jar pre out_JBAT.txt out_JBAT.hic.part <(cat out_JBAT.log | grep PRE_C_SIZE | awk '{print $2" "$3}')) && (mv out_JBAT.hic.part out_JBAT.hic)

Step2. After adjusting in juicerbox

/software/HapHiC/scripts/../utils/juicer post -o out_JBAT out_JBAT.review.assembly out_JBAT.liftover.agp asm.fa

Step3. Regenerating a HiC contact map for the final assembly

(/software/HapHiC/scripts/../utils/juicer pre samblasterHiCfilter2.bed out_JBAT.FINAL.agp asm.fa.fai | sort -k2,2d -k6,6d -T ./ --parallel=8 -S32G | awk 'NF' > alignments_sorted.txt.part) && (mv alignments_sorted.txt.part alignments_sorted.txt)

(java -jar -Xmx32G /software/HapHiC/scripts/../utils/juicer_tools.1.9.9_jcuda.0.8.jar pre alignments_sorted.txt out.hic.part out_JBAT.FINAL.fa.chrom.sizes) && (mv out.hic.part out.hic)

I may have encountered some bugs due to the python version

Some bug like subprocess.CalledProcessError .... returned non-zero exit status 1 may caused by python and packages version.

This error always occurs in the python3.10 environment and I guess is the matplotlib version from pip/conda caused. After I redeployed the environment according to the py312 method, there is no error and running successfully. I hope it can help other users.

Why is there only one cluster in the HapHiC_cluster.log

Hello Xiaofei,
Thank you for developing the Haphic software. my HapHiC_cluster.log have a question "<HapHiC_cluster.py> [run_mcl_clustering] The maximum number of clusters (1) is even less than the expected number of chromosomes (16). You could try higher inflation. " In this case, how to choose the "best" inflation recommendation

Thank you

BED file support

Hi there
Is it possible to implement support for BED files along with BAM and PAIR formats?

Best regards

How to improve allotetraploid scaffolding in haphic?

Hi , I used haphic for allotetraploid genome (2n=4x=44). It makes 22 groups like the following which have huge variations in chromosome sizes as given below and can be seen in juicebox plot. Could you suggest how to improve this

group1 335147534
group2 265712237
group3 259566223
group4 256679759
group5 252275747
group6 223506148
group7 216891361
group8 204149503
group9 178470265
group10 169460411
group11 165539612
group12 151015244
group13 150193252
group14 124757396
group15 106023119
group16 103438627
group17 102100592
group18 94194491
group19 71377479
group20 51412018
group21 41958356
group22 41183465

How to correctly reassignment？

Hello, Xiaofei. I'm having some small issues with HapHiC.
My species is an autotetraploid (2n=4x=44), and I used p_utg generated by hifiasm as the input file.
When clustering, the command is as follows: haphic cluster genome.final.fasta HiC.filtered.bam 44 --threads 80
The result is as follows:
2024-04-29 09:48:41 <HapHiC_cluster.py> [recommend_inflation] You could try inflation from 1.5 (length ratio = 0.75)
Entering the recommended inflation factor 1.5 folder:
-rw-rw-r-- 1 ps ps 152 4月 29 09:48 group10_32091997bp.txt
-rw-rw-r-- 1 ps ps 368 4月 29 09:48 group11_31248394bp.txt
-rw-rw-r-- 1 ps ps 153 4月 29 09:48 group12_29837997bp.txt
-rw-rw-r-- 1 ps ps 215 4月 29 09:48 group13_29756760bp.txt
-rw-rw-r-- 1 ps ps 185 4月 29 09:48 group1_41105304bp.txt
-rw-rw-r-- 1 ps ps 214 4月 29 09:48 group14_29336476bp.txt
-rw-rw-r-- 1 ps ps 119 4月 29 09:48 group15_27944639bp.txt
-rw-rw-r-- 1 ps ps 89 4月 29 09:48 group16_26507170bp.txt
-rw-rw-r-- 1 ps ps 182 4月 29 09:48 group17_26402146bp.txt
-rw-rw-r-- 1 ps ps 243 4月 29 09:48 group18_26392657bp.txt
-rw-rw-r-- 1 ps ps 151 4月 29 09:48 group19_26370085bp.txt
-rw-rw-r-- 1 ps ps 308 4月 29 09:48 group20_26103341bp.txt
-rw-rw-r-- 1 ps ps 213 4月 29 09:48 group21_25795553bp.txt
-rw-rw-r-- 1 ps ps 152 4月 29 09:48 group22_23983702bp.txt
-rw-rw-r-- 1 ps ps 119 4月 29 09:48 group23_23911665bp.txt
-rw-rw-r-- 1 ps ps 184 4月 29 09:48 group2_39247364bp.txt
-rw-rw-r-- 1 ps ps 121 4月 29 09:48 group24_23558558bp.txt
-rw-rw-r-- 1 ps ps 150 4月 29 09:48 group25_23164941bp.txt
-rw-rw-r-- 1 ps ps 208 4月 29 09:48 group26_22711677bp.txt
-rw-rw-r-- 1 ps ps 151 4月 29 09:48 group27_22662324bp.txt
-rw-rw-r-- 1 ps ps 244 4月 29 09:48 group28_22502537bp.txt
-rw-rw-r-- 1 ps ps 120 4月 29 09:48 group29_21592425bp.txt
-rw-rw-r-- 1 ps ps 211 4月 29 09:48 group30_19931306bp.txt
-rw-rw-r-- 1 ps ps 151 4月 29 09:48 group31_18740276bp.txt
-rw-rw-r-- 1 ps ps 119 4月 29 09:48 group32_18509058bp.txt
-rw-rw-r-- 1 ps ps 180 4月 29 09:48 group33_17780926bp.txt
-rw-rw-r-- 1 ps ps 152 4月 29 09:48 group3_38952077bp.txt
-rw-rw-r-- 1 ps ps 243 4月 29 09:48 group34_17501202bp.txt
-rw-rw-r-- 1 ps ps 152 4月 29 09:48 group35_17396856bp.txt
-rw-rw-r-- 1 ps ps 208 4月 29 09:48 group36_17321984bp.txt
-rw-rw-r-- 1 ps ps 88 4月 29 09:48 group37_16929067bp.txt
-rw-rw-r-- 1 ps ps 119 4月 29 09:48 group38_16284770bp.txt
-rw-rw-r-- 1 ps ps 119 4月 29 09:48 group39_14358708bp.txt
-rw-rw-r-- 1 ps ps 88 4月 29 09:48 group40_13554138bp.txt
-rw-rw-r-- 1 ps ps 270 4月 29 09:48 group41_12117333bp.txt
-rw-rw-r-- 1 ps ps 148 4月 29 09:48 group42_9552907bp.txt
-rw-rw-r-- 1 ps ps 247 4月 29 09:48 group4_37334910bp.txt
-rw-rw-r-- 1 ps ps 88 4月 29 09:48 group43_9118406bp.txt
-rw-rw-r-- 1 ps ps 88 4月 29 09:48 group44_8370283bp.txt
-rw-rw-r-- 1 ps ps 245 4月 29 09:48 group5_36935828bp.txt
-rw-rw-r-- 1 ps ps 216 4月 29 09:48 group6_34040223bp.txt
-rw-rw-r-- 1 ps ps 216 4月 29 09:48 group7_33466005bp.txt
-rw-rw-r-- 1 ps ps 152 4月 29 09:48 group8_33337447bp.txt
-rw-rw-r-- 1 ps ps 369 4月 29 09:48 group9_32236358bp.txt
I found that the contigs have been clustered into 44 groups. However, when I perform the second step of reassignment, I found that only 39 groups are redirected in the final_groups, which is not consistent with the expected 44 groups. How can I solve this problem?

Question for Work with hifiasm

Hello,

I have a question regarding the use of haphic for genome assembly and processing results with hifiasm.

The genome I'm working with is diploid, and the haploid chromosome number is 11. In the command:

haphic pipeline allhaps.fa HiC.filtered.bam nchrs --gfa hap1.p_ctg.gfa,hap2.p_ctg.gfa

should I set nchrs to 11 or 22?

Additionally, what is the difference between this approach and directly using p_utg.fa? Will I obtain haplotype-phased assemblies if I run:

haphic pipeline p_utg.fa HiC.filtered.bam nchrs --gfa p_utg.gfa?

And in this case, should nchrs be set to 11 or 22?

Thank you!

TypeError: object of type 'NoneType' has no len()

Hi Xiaofei,

When I run HapHiC, I encounter an error, which doesn't seem to be an issue with the program installation. Could you please guide me on how to solve this problem? I am combining the hap1 and hap2 outputs from the hifiasm software and assembling them at the chromosome level through Hi-C reads. I will upload my log.
HapHiC_cluster.log

Best regards,
Tuo

How does HapHiC process collapsed contigs?

Hi,
Thank you for developing HapHiC, it exhibits excellent performance in polyploid scaffolding. I have two questions and hope you can provide assistance:
Q1: How does HapHiC process collapsed contigs/unitigs? In your schematic diagram, it appears that collapsed contigs are ultimately assigned to one of the groups. If possible, can HapHiC provide information about the identified collapsed contigs, so that I can handle these collapsed contigs separately.

Q2: If my input reference genome contains two completely identical (homozygous) contigs/unitigs, how will HapHiC allocate them? Will 'filter_bam' filter out the alignment information on these contigs?"

I would appreciate your assistance. Thank you so much!

Iven

pore-C data

Hello, @zengxiaofei

Can HapHiC input porec data? If not, what adjustments do I need to make?

Thank you in advance for your response

How to set the chromosome number if I use p_ctg.fa from hifiasm when running haphic pipeline

Thank you for your excellent work!
I wonder if I run the haphic pipeline correctly. I have a genome of 4G (a diploid, 2n=26), and I just use the p_ctg.fa file from hifiasm and HiC data. When I run the pipeline I set the nchr as 13, and I have a good result. But when I saw the issue I found that some users used combined hap*.p_ctg file to run the pipeline, and set the nchr as doubled (in my case as 26). So I wonder if I run it correctly, I just want a scaffold genome of haploid, and the HiC contact map of 13 chromosomes by the way.

Appreciate for your early reply!

correct and Clustering

Hello xiaofei,
My commands are as follows

Haphic=/data/Erick_Tong/software/HapHiC-main
ln -s ../01hifiasm_hfhc/XH01_asm.hic.p_utg.fa asm.fa
ln -s ../01hifiasm_hfhc/00data/XH01_all_hic_R1.fq.gz Hic_R1.fq.gz
ln -s ../01hifiasm_hfhc/00data/XH01_all_hic_R2.fq.gz Hic_R2.fq.gz

bash /data/Erick_Tong/05Analysis_script/remind.sh "20240310-1 haphic 01 start"
## 01 Align Hi-C data to the assembly, remove PCR duplicates and filter out secondary and supplementary alignments
bwa index asm.fa
bwa mem -t 36 -5SP asm.fa Hic_R1.fq.gz Hic_R1.fq.gz | samblaster | samtools view - -@ 24 -S -h -b -F 3340 -o HiC.bam

## 02 Filter the alignments with MAPQ 1 (mapping quality ≥ 1) and NM 3 (edit distance < 3)
$Haphic/utils/filter_bam HiC.bam 1 --nm 3 --threads 36 | samtools view - -b -@ 26 -o HiC.filtered.bam

## 03 Clustering
haphic cluster --threads 36 --correct_nrounds 2 --remove_allelic_links 3 asm.fa HiC.filtered.bam 51

The result file HapHiC_cluster.log is
HapHiC_cluster.log
I can't find the "best" inflation recommendation in the log file HapHiC_cluster.log and The maximum clusters is 2 , however, it is a triploid organism with approximately 51 chromosomes.
I would be very grateful if you could provide some help!

Correct way to run HapHiC with Hifiasm (HiC) data

Dear Developer,

Thank you your effort on developing the HapHiC,

I have a Hifiasm (hiC) data with both (hap1.p_ctg.gfa and hap2.p_ctg.gfa), I noticed that at the Section Work with hifiasm (experimental) there is a recommended command.

/path/to/HapHiC/haphic pipeline allhaps.fa HiC.filtered.bam nchrs --gfa "hap1.p_ctg.gfa,hap2.p_ctg.gfa"

The species I'm working on is diploid and having 33 chromosomes, I merged both hap1.p_ctg.gfa and hap2.p_ctg.gfa and leads to a single FASTA file as allhaps.fa, and use nchrs=66. (from #1 )

I just would like to confirm if it is the correct way for handling the data from hifiasm (hic). Thanks a lot for your support!

Best wishes,
Runpeng Luo

I encountered an issue during the installation of HapHic

I encountered an HTTP 000 error repeatedly when installing the software.Executive command "conda env create -HapHiC/conda_env/environment_py310.yml"

Omni-C support ?

The Omni-C data is supported by HapHiC ?

the applicability of haphic

Dear Xiaofei,

I would like to seek your advice regarding the applicability of haphic. I am currently planning to assemble the genome of a self-pollinating plant with very low heterozygosity, similar in size to the rice genome. In your experience, would you recommend using haphic for chromosome-level assembly in this case?

Additionally, if the assembly results from hifiasm are already quite close to a complete genome (around 50%-60% of the chromosomes), would you suggest first breaking the contigs with allhic before using haphic?

Thank you for your time and guidance.

Best regards,

high number of chromosomes not accepted?

Hi there,

This seems like a great tool and I am really keen to try it! I was trying to set this for 42 chromosomes, but it doesn't seem to accept this. My organism is 2n=6x=42

This is my error message:
RuntimeError: Pipeline Abortion: Inflation recommendation failed. It seems that some chromosomes were grouped together, or the maximum number of clusters is even less than the expected number of chromosomes. For more details, please check out the logs.
Traceback (most recent call last):
File "/gpfs01/home/mbars7/HapHiC/./haphic", line 96, in
subprocess.run(commands, check=True)
File "/gpfs01/home/mbars7/miniconda3/envs/haphic/lib/python3.10/subprocess.py", line 524, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/gpfs01/home/mbars7/HapHiC/scripts/HapHiC_pipeline.py', '/gpfs01/home/mbars7/danica_2/hetero_hifiasm_HiC_HiFi/danica_hetero_HiC_hifi.hic.p_ctg.fa', '/gpfs01/home/mbars7/danica_2/hetero_hifiasm_HiC_HiFi/filtered_name.bam', '42']' returned non-zero exit status 1

Is there something else I should be adding to this?

Thank you for any advice!

the final number of groups was smaller than the nchrs I set.

A problem I had when using haphic occurred with both my tetraploid waxapple data and a simulated triploid data. When I used the default parameters, the final number of groups was smaller than the nchrs I set. When I modify the value of inflation to the recommended value in HapHiC_cluster.log, I find that the number of groups is still smaller than nchrs, and even when I continue to increase the value of inflation, the number of groups will show a trend of decline.But,as the value of inflation increases, the number of groups in the folder /inflation_X also increases
My commands：

bwa index ../00.data/asm.fa
bwa mem -5SP -t 32 ../00.data/asm.fa ../00.data/mockTriploid_r1.fq.gz ../00.data/mockTriploid_r2.fq.gz | samblaster | samtools view - -@ 32 -S -h -b -F 3340 -o HiC.bam
/HapHiC/utils/filter_bam HiC.bam 1 --nm 3 --threads 32 | samtools view - -b -@ 32 -o HiC.filtered.bam

/HapHiC/haphic cluster ../../00.data/asm.fa ../HiC.filtered.bam 36 --max_inflation 10.0 --remove_allelic_links 3
/HapHiC/haphic reassign ../00.data/asm.fa full_links.pkl ./inflation_5.0/mcl_inflation_5.0.clusters.txt paired_links.clm --nclusters 36

Each group only contains one contig

I'm sorry to bother you. I encountered two issue while using it。
（1） in the clustering stage. The number of contig participating in the clustering is very low, and in the final results, there is only one contig in each group.
（2）During the clustering stage, contig are fragmented into a large number of small segments, totaling over 70,000 fragments. Ultimately, this results in over 70,000 groups, with only one fragment per group.

Parameter for output directory

Hello, can you add a parameter for the output directory? Right now, HapHiC outputs everything in the current working directory. This will lead to problems if trying to run HapHiC on several samples from the same working directory. Thanks for your assistance.

Unable to install due to Intel channel issues

Hi, Xiaofei,

I am reaching out to inquire if it would be possible to provide a container for downloading. I am currently working on a cluster, and I am unable to use the Intel channel to download the three required programs (intel-openmp, mkl, tbb) through the yml file installation.

I am not certain about the impact of missing these three files on the operation of Haphic. It seems that others, including #38 have encountered a similar issue.

Thank you for your time and support.
Best regards!

How to define the nchr for the pipeline

Hello HapHiC developers,

I am handling a species with ~46 chromosomes.According to our previous HiC result, this species seems to be a Triploid.
However, we also found that some clusters maybe diploid on the HiC heatmap. We are not quite sure at this moment.
So, how should I define the 'nchr' while run the HapHiC pipeline？Should I use 46*3=138？

Regards,
Zixi

Overrepresented / duplicated contigs as input

Hi,

Thank you for this impressive tool, I am very excited to try it!

I am not sure about which set of contigs to choose for input. If I understand correctly HapHic takes care of the collapsed contigs and chimeric ones. What happens if I feed it with overrepresented sequences like an under-purged set of contigs?
Here is a merqury plot of the input set I will try:

I used hifiasm with HiC phasing to get four "haplotypes". Plot represents a merged set of all 4 hap sets.

With purging I always lose a lot of haplo- and diplotigs, which will be a problem in the scaffolding stage. Do you think the above set is siutable for HapHic?

Work with hifiasm using haplotype-specific data

Hello @zengxiaofei ,
I used hifiasm with hifi and HiC data, and generate haplotype-specific data, including p_ctg.fa, p_utg.fa, r_utg.fa, and hap*.p_ctg.fa.
As you mentioned in the tutorial, if I want to run the script below:

/path/to/HapHiC/haphic pipeline allhaps.fa HiC.filtered.bam nchrs --gfa "hap1.p_ctg.gfa,hap2.p_ctg.gfa"

1) Which fa file should I use as the <allhaps.fa>? p_utg, p_ctg, or cat the hap*.p_ctg.fa as one file?
According to the previous post #16 , maybe I should cat the hap*.p_ctg.fa as one file?

2) Which fa should I use for HiC reads mapping to generate the <HiC.filtered.bam>?

3) I tested both 46 and 46*3 for scaffolding as I mentioned here #19. I used p_utg.fa as the only input, without hap*.gfa.
I am a beginner in genome assembly. It seems that 46 is better for resolving the triploid genome?
But the survey of my species seems to be 2n=3x=2.4G, so the size of 46*3 result seems better? I am really confused.

The pic below using the command: haphic pipeline p_utg.fa HiC.filtered.bam 138 --gfa p_utg.gfa

The pic below using the command: haphic pipeline p_utg.fa HiC.filtered.bam 46 --gfa p_utg.gfa

Regards,
Zixi

Juicerbox YaHS

Hi xiaofei,
I used the Juicerbox YaHS you provided to generate the .assembly and .hic files, and finally generated out_JBAT.assembly ，out_JBAT.assembly.agp and out_JBAT.hic. However, I found that the generated out_JBAT.assembly.agp file has only one assembly, which cannot be divided into multiple assemblies, but his previous input file can be divided into multiple groups, and the corresponding contig and Chr frames cannot be displayed when input into juicerbox with .assemly and .hic. I would like to ask how to solve this problem.

Unexpected keyword argument 'affinity'

I get the following error when I try to run haphic:

2024-04-27 13:16:53 <HapHiC_reassign.py> [run] A pickle file is input meanwhile --remove_allelic_links is set, allelic Hi-C links will NOT be treated specially in the reassignment step
2024-04-27 13:16:53 <HapHiC_reassign.py> [parse_pickle] Parsing input pickle file...
2024-04-27 13:16:53 <HapHiC_reassign.py> [parse_clusters] Parsing .clusters.txt file...
2024-04-27 13:16:53 <HapHiC_reassign.py> [run] File parsing and data preparation finished in 144.6104302406311s
2024-04-27 13:16:53 <HapHiC_reassign.py> [run_reassignment] Performing reassignment...
2024-04-27 13:16:53 <HapHiC_reassign.py> [run_reassignment] [result::round1] Total: 21953, consistent: 326, rescued: 237, reassigned: 13, not rescued: 21377
2024-04-27 13:16:53 <HapHiC_reassign.py> [run_reassignment] Performing reassignment...
2024-04-27 13:16:54 <HapHiC_reassign.py> [run_reassignment] [result::round2] Total: 21953, consistent: 538, rescued: 29, reassigned: 2, not rescued: 21384
2024-04-27 13:16:54 <HapHiC_reassign.py> [run_reassignment] Performing reassignment...
2024-04-27 13:16:54 <HapHiC_reassign.py> [run_reassignment] [result::round3] Total: 21953, consistent: 563, rescued: 0, reassigned: 0, not rescued: 21390
2024-04-27 13:16:54 <HapHiC_reassign.py> [run] [result::round3] Result has converged after 2 rounds of reassignment, break
2024-04-27 13:16:54 <HapHiC_reassign.py> [run_reassignment] Performing additional round of rescue...
2024-04-27 13:16:54 <HapHiC_reassign.py> [run_reassignment] [result::additional_rescue] Total: 21953, consistent: 653, rescued: 192, reassigned: 0, not rescued: 21108
2024-04-27 13:16:54 <HapHiC_reassign.py> [run] 3 round(s) of reassignment finished in 0.277385950088501s, average 0.09246198336283366s per round
2024-04-27 13:16:54 <HapHiC_reassign.py> [agglomerative_hierarchical_clustering] Performing additional agglomerative hierarchical clustering...
Traceback (most recent call last):
  File "HapHiC_pipeline.py", line 517, in <module>
    main()
  File "HapHiC/scripts/HapHiC_pipeline.py", line 502, in main
    haphic_reassign(args)
  File "HapHiC/scripts/HapHiC_pipeline.py", line 406, in haphic_reassign
    HapHiC_reassign.run(args, log_file=LOG_FILE)
  File "HapHiC/scripts/HapHiC_reassign.py", line 877, in run
    hc_cluster_dict = agglomerative_hierarchical_clustering(
  File "HapHiC/scripts/HapHiC_reassign.py", line 535, in agglomerative_hierarchical_clustering
    clust = AgglomerativeClustering(n_clusters=nclusters, affinity="precomputed", linkage="average", distance_threshold=None)
TypeError: AgglomerativeClustering.__init__() got an unexpected keyword argument 'affinity'
Traceback (most recent call last):
  File "HapHiC/haphic", line 96, in <module>
    subprocess.run(commands, check=True)
  File "/private/home/kkyriaki/micromamba/envs/haphic/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,

Probably this line should be changed to metric="precomputed" inside the HapHiC_reassign.py script:

clust = AgglomerativeClustering(n_clusters=nclusters, affinity="precomputed", linkage="average", distance_threshold=None)

ModuleNotFoundError: No module named 'pysam'

haphic pipeline --help
Traceback (most recent call last):
File "/share/home/stu_chenhuiru/software/HapHiC-main/scripts/HapHiC_pipeline.py", line 17, in
import HapHiC_cluster
File "/share/home/stu_chenhuiru/software/HapHiC-main/scripts/HapHiC_cluster.py", line 15, in
import pysam
ModuleNotFoundError: No module named 'pysam'
Traceback (most recent call last):
File "/share/home/stu_chenhuiru/software/HapHiC-main/haphic", line 96, in
subprocess.run(commands, check=True)
File "/share/home/stu_chenhuiru/miniconda3/lib/python3.11/subprocess.py", line 569, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/share/home/stu_chenhuiru/software/HapHiC-main/scripts/HapHiC_pipeline.py', '--help']' returned non-zero exit status 1.

absence of correlation between HiC contacts and generated assembly file for manual curation

Dear all,

Thank you for the nice tool. I am trying to scaffold a very complex polyploid genome (expected hexaploid with around 66 chromosomes). Despite the challenge genome, haphic can group the expected number of scaffolds based on the HiC map, however, the generated agp and assembly files do not seem to match the HiC contacts observed. Now, I am not sure if there are a lot of misassemblies that need to be corrected or if there is something wrong with the outputs of haphic. Please see
2024.01.11.13.24.49.HiCImage.pdf
the attached files.
Uploading 2024.01.11.13.26.22.HiCImage.pdf…

2024.01.11.13.23.45.HiCImage.pdf
2024.01.11.13.24.23.HiCImage.pdf