lh3 / pangene Goto Github PK

Constructing a pangenome gene graph

Makefile 0.66% C 44.24% Roff 3.53% JavaScript 20.47% Shell 0.87% TeX 30.25%

pangene's Introduction

Getting Started

# Check prebuilt graphs at https://pangene.bioinweb.org

# Install pangene
git clone https://github.com/lh3/pangene
cd pangene && make

# Alternatively, download the precompiled binaries for arm-mac and x64-linux
curl -L https://github.com/lh3/pangene/releases/download/v1.1/pangene-1.1-bin.tar.bz2|tar jxf -

# The C4 example with provided alignment
./pangene test/C4/*.paf.gz > C4.gfa        # generate the graph
k8 pangene.js call C4.gfa > C4.bubble.txt  # identify bubbles

# Deploy the GFA server on the C4 example; require pangene-1.1-bin
cd pangene-1.1-bin                         # run gfa-server in this directory
bin_arm64-mac/gfa-server /path/to/C4.gfa   # or use bin_linux-x64 on x64-linux
# open "http://127.0.0.1:8000/view?gene=C4A,C4B" in a web browser

# Deploy the GFA server on the human graph
bin_arm64-mac/gfa-server -d html data/*.gfa.gz 2> server.log &
# open "http://127.0.0.1:8000" in a web browser

# Align proteins to each genome (general use cases; no example data)
miniprot --outs=0.97 --no-cs -Iut16 genome1.fna proteins.faa > genome1.paf
miniprot --outs=0.97 --no-cs -Iut16 genome2.fna proteins.faa > genome2.paf

# Construct a pangene graph
pangene genome1.paf genome2.paf > graph.gfa

# Check manpage
man ./pangene.1

Getting Started
Introduction
Graph Construction
Graph Visualization
Citation
Limitations

Introduction

Pangene is a command-line tool to construct a pangenome gene graph. In this graph, a node repsents a marker gene and an edge between two genes indicates their genomic adjaceny on input genomes. Pangene takes the miniprot alignment between a protein set and multiple genomes and produces a graph in the GFA format. It attempts to reduce the redundancy in the input proteins and filter spurious alignments while preserving close but non-identical paralogs. The output graph can be visualized in generic GFA viewers such as BandageNG or via a web interface. Users can explore local human subgraphs at a public server. Prebuilt pangene graphs can be found at DOI:10.5281/zenodo.8118576.

Graph Construction

Pangene takes a list of protein-to-genome alignment as input. To generate these alignments, you need to align the same set of proteins to multiple genomes. How to choose the protein set can be tricky.

Preparing a protein set

For constructing a human pangene graph, the simplest choice is to use annotated genes in GRCh38. It is highly recommended to name a protein sequence like RGPD6:ENSP00000512633.1 where RGPD6 is the gene name and ENSP00000512633.1 is the unique protein identifier. In the output GFA, nodes are named after genes, so you would want to use human-readable gene names for visualization later. You may use the following command line to extract protein sequences from Ensembl or GenCode annotation:

k8 pangene.js getaa gene-anno.gtf protein-seq.faa > proteins.faa

With pangene, different isoforms or diverged alleles of the same gene can be present in the protein set, though in practice, we find selecting the canonical isoform per gene tends to give a cleaner graph probably possibly due to annotation errors among rare isoforms. For the GenCode annotation, use getaa -c to extract canonical isoforms only.

For a new species without good gene annotation, you may use protein annotations from a closely related species. You may pool proteins from multiple closely related species as well. Pangene aims to work with such input but this use case has not been thoroughly carefully evaluated. Given a bacteria pangenome of the same species, you may predict genes with existing tools, cluster them with CD-HIT or MMseqs2 and feed the representative protein in each cluster to pangene.

Aligning proteins to genomes

Pangene currently only works with miniprot's PAF output. We usually use the following command line:

miniprot --outs=0.97 --no-cs -Iut16 genomeX.fna proteins.faa > genomeX.paf

For bacterial data, add -S to disable splicing.

Constructing a pangene graph

The following command-line constructs a pangene graph

pangene *.paf > graph.gfa

If the output graph is cluttered in the Bandage viewer, you may add option -a2 to filter out edges supported by a single genome. By default, pangene filters out genes occurring in less than 5% of the genomes after deredundancy. If you want to retain low-frequency genes, add -p0 to disable the filter.

Analyzing a graph

The GFA file is the master output. You can extract various information from this file. You may find local gene-level variations with

k8 pangene.js call graph.gfa > bubble.txt

or get the presence/absence of each gene with

k8 pangene.js gfa2matrix graph.gfa > gene_presence_absence.Rtab

Graph Visualization

You can look at the entire graph in the Bandage GFA viewer. If you are interested in a particular gene, it is best to set up gfa-server which is part of gfatools. Here is a public server for human genes. You can deploy this server on your machine with

curl -L https://github.com/lh3/pangene/releases/download/v1.1/pangene-1.1-bin.tar.bz2|tar jxf -
cd pangene-1.1-bin
bin_arm64-mac/gfa-server -d html data/*.gfa.gz 2> server.log # for Mac

Then you can open link http://127.0.0.1:8000/ in your browser, type gene names and visualize a local subgraph around the desired genes.

Citation

If you use pangene in your work, please consider to cite:

H Li, M Marin, MR Farhat (2024) Exploring gene content with pangenome gene graphs, arXiv:2402.16185

Limitations

Pangene only works with miniprot's PAF output.
In the output graph, arcs on W-lines may be absent from L-lines.

pangene's People

Contributors

Stargazers

Watchers

Forkers

shankarkshakya ahmedbajwa03 ningshuang-yao animesh meowboy326 qinqian hlilab

pangene's Issues

Complex tangles on gene graphs

Hi, @lh3

I use pangene to run with Arabidopsis pangenome for idenitifing complex gene clusters. But I couldn't really understand why the overall graph is a hairball (also in the HPRC graph Zenodo). In some cases, it seems everything is connected together.
Do you have any suggestions for this? It could be annotation errors (misannotated gene or transposon elements). I will manually check those nodes and add some filtering steps before building gene graphs.

But human and Arabidopsis annotation should be good enough. Would be possible pangene output those "hotspots" for further validation?

Self mapping for reduce the paralog or TE influence on the local subgraph

Hi, @lh3

I saw some weird subgraphs in A.thaliana with hundreds of long reads genome. AT1G31270 looks normal in the reference annotation, but now we find that it should be an LTR fragment with a new annotation and it lacks support from dozens of long reads RNA-Seq. So this gene mapping position varied between different genomes (different chr, very far location). A.thaliana synteny should be quite good without too many translocations. Is there any way to filter these unreliable mapping?

For species with recent/ancient duplication (fish/flowering plant), miniprot may map these paralogs into different positions. Can we use self-mapping to filter these genes first? If I use the reference annotation map to the reference genome by miniprot, 344 genes cannot find the same chromosomes. However, these genes will break the local subgraph structure.

If you want, I can share you with graph and weird gene list

subgraph1

subgraph2

Aligner effect for local graph inference

Hi, @lh3

Does the aligner have effects on pangene inference? Or pangene has some inside control. I am trying to infer the duplication number of non-ref individuals with A.thaliana HiFi assemblies. I use miniprot, lifoff and minimap2 -x splice for mapping cds or protein sequences. Here are the dot plots. Since the overall synteny is quite good in A.thaliana, miniport looks like it has more false hints. How does this affect the final graph? Could we also use minimap2 -x splice for cds mapping to build graph?

miniport --outc=0.9 --outs=0.9 --no-cs -Iu
minimap2 -x splice 
liftoff -p {threads} -sc 0.9 -copies -g {input.gff} -u {output.umap} -o {params.gff} -dir {params.tmpdir} -cds -polish

Certain questions about pangene

Hello,Prof. Li! I have carefully read the preprint of pangene and the documentation, but I still have some questions I would like to ask you.

Can pangene handle species that have experienced chromosomal breakage and fusion, such as in plants where the fusion of a genus's species is very common? In this case, can pangene provide the range of breakpoints, or will it simply be lost? As I know, in the cactus-minigraph workflow based on genome alignment, this part will be directly removed.
For species that have undergone different whole-genome duplication events, is there a corresponding method to construct the gene set of this clade? If so, this would be a very nice idea for understanding the gene loss and gain in species.

Something wrong when I use pangene‘s code

Hello Professor Li, I have some problems when running with the code you presented. The problems are as follows:

$k8 ~/software/pangene-1.1/pangene.js getaa 00.data/genomic.gtf 00.data/protein.faa > 01.pro_align/proteins.faa
/storage/public/home/2022060223/software/pangene-1.1/pangene.js:9: SyntaxError: Unexpected identifier
for (let j = i; j < this.length - 1; ++j)
^
SyntaxError: Unexpected identifier

I checked what seems to be the problem with this file pangen.js? In order to check whether there is a problem with the k8 version, I used node to run, and the results are as follows:

$node ~/software/pangene-1.1/pangene.js getaa 00.data/genomic.gtf 00.data/protein.faa > 01.pro_align/proteins.faa
/storage/public/home/2022060223/software/pangene-1.1/pangene.js:1349
var cmd = args.shift();
^

TypeError: args.shift is not a function
at main (/storage/public/home/2022060223/software/pangene-1.1/pangene.js:1349:17)
at Object. (/storage/public/home/2022060223/software/pangene-1.1/pangene.js:1363:1)
at Module._compile (node:internal/modules/cjs/loader:1369:14)
at Module._extensions..js (node:internal/modules/cjs/loader:1427:10)
at Module.load (node:internal/modules/cjs/loader:1206:32)
at Module._load (node:internal/modules/cjs/loader:1022:12)
at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:135:12)
at node:internal/main/run_main_module:28:49

Node.js v20.12.2

I'm not sure why, but it looks like there's something wrong with the pangene-1.1 script, but I'm not sure.

Hope to get your reply, thank you!

Empty pangraph.gfa output

Hi Li,
Thnaks for this awesome and super fast tool.
I run the pangene tool for my 50 plant genome ( allotetraploid genomes), i could create the .paf file.
However, i found empty output on the pangraph.gfa

Any help will be highly appreciated,

---summary of my run.... check the bold letters

[M::pg_post_process::30.9320.91] genome 49: 8490 pseudo, 149337 shadow
**[M::pg_post_process::30.9960.91] genome 50: 7639 pseudo, 111360 shadow**
[M::pg_gen_vtx::36.3180.92] selected 0 vertices out of 7646947 genes
[M::pg_graph_gen::38.6410.93] round-1 graph: 0 genes and 0 arcs
[M::pg_graph_flt_high_occ::38.6410.93] 0 high-occurrence segments
[M::pg_graph_flt_high_occ::38.6410.93] 0 high-degree segments additionally
[M::pg_graph_gen::40.9060.93] round-2 graph: 0 genes and 0 arcs
[M::pg_mark_branch_flt_arc::40.9060.93] marked 0 diverged branches
[M::pg_mark_branch_flt_hit::42.7860.93] marked 0 diverged hits
[M::pg_mark_branch_flt_arc::44.8210.94] marked 0 diverged branches
[M::pg_mark_branch_flt_hit::46.6930.94] marked 0 diverged hits
[M::pg_mark_branch_flt_arc::48.7230.94] marked 0 diverged branches
[M::pg_mark_branch_flt_hit::50.5930.94] marked 0 diverged hits
[M::pg_graph_cut_low_arc::52.6210.95] filtered 0 low-occurrence arcs
[M::pg_graph_gen::52.621*0.95] round-3 graph: 0 genes and 0 arcs
[M::main] Version: 0.0-r87-dirty

Below is my command.

`pangene -a2 NAM00.paf NAM01_pan.paf NAM04_pan.paf NAM05_pan.paf NAM08_pan.paf NAM10_pan.paf NAM12_pan.paf NAM13_pan.paf NAM14_pan.paf NAM15_pan.paf NAM17_pan.paf NAM23_pan.paf NAM25_pan.paf NAM26_pan.paf NAM28_pan.paf NAM29_pan.paf NAM30_pan.paf NAM31_pan.paf NAM32_pan.paf NAM33_pan.paf NAM34_pan.paf NAM36_pan.paf NAM37_pan.paf NAM38_pan.paf NAM39_pan.paf NAM40_pan.paf NAM42_pan.paf NAM43_pan.paf NAM45_pan.paf NAM46_pan.paf NAM47_pan.paf NAM51_pan.paf NAM53_pan.paf NAM56_pan.paf NAM57_pan.paf NAM65_pan.paf NAM66_pan.paf NAM68_pan.paf NAM71_pan.paf NAM72_pan.paf NAM73_pan.paf NAM75_pan.paf NAM76_pan.paf NAM78_pan.paf NAM79_pan.paf NAM82_pan.paf NAM83_pan.paf NAM85_pan.paf NAM86_pan.paf NAM87_pan.paf NAM88_pan.paf

[M::main] Real time: 58.202 sec; CPU: 55.320 sec; Peak RSS: 4.586 GB`

Thanks
sam

More general graph building

Hi, @lh3
Protein is conserved across the tree of life, but it may not have enough synteny in the complex repeat region or complex genome. Could be possible to use any pre-aligned markers (eg. panTE annotation from RepeatMasker, centromere annotation from SRF)? It may be out of the scope of pangene, but if it takes PAF and marker to build a graph, maybe we can also feed other pre-aligned markers.

lh3 / pangene Goto Github PK

pangene's Introduction

Getting Started

Table of Contents

Introduction

Graph Construction

Preparing a protein set

Aligning proteins to genomes

Constructing a pangene graph

Analyzing a graph

Graph Visualization

Citation

Limitations

pangene's People

Contributors

Stargazers

Watchers

Forkers

pangene's Issues

subgraph1

subgraph2

Recommend Projects

Recommend Topics

Recommend Org

Jobs