GithubHelp home page GithubHelp logo

gark's Introduction

GARK

Check genome assembly repeat content using kmers

It is now common to compare genome assembly and read kmer content to verify assembly completness with kat comp (fig 1. of https://www.biorxiv.org/content/10.1101/064733v2.full.pdf). This gives a quick overview of missing kmers in the homozygote or heterozygote sections of the genome.

Hereunder we present an extension of this analysis for repeated parts of the genome. The idea emerged while comparing ONT and PacBio HiFi assemblies. HiFi assemblies were larger than ONT assemblies with close to the same busco score. HiFi assembly lengths were also closer to the expected genome size infered from kmer content using genomescope2 (http://qb.cshl.edu/genomescope/genomescope2.0/).

kat comp compares kmer counts in the genome versus in the reads. Zero counts correspond to kmers missing in the genome but present in the reads. In a haploid genome assembly, counts of 1 correspond to homozygous or heterozogous sections. In a diploid genome assembly, kmer counts of 1 correspond to heterozygous sections and counts of 2 to homozygous sections. Counts above 2 correspond to section which are several times in the genome and can arise due to polyploidy or non resolved ancient duplications or recent duplications or local duplications.

For repeats, the idea is first to threshold genome kmers. We could for example decide that we call a "repeated kmer" a kmer present at least 50 times in the genome. Then in the reads this kmer should be present 50 * coverage (homozygous coverage peak of genomescope2) times. If this is the case the ratio (read kmer count)/(genome kmer count) should be equal to the coverage for all repeated kmers. Kmers having a ratio far from the expected coverage show repeat collapse or expension events in the genome compared to the reads.

The procedure chains seven steps :

  • setting the genome kmer count threshold
  • calculating the read kmer count threshold
  • creating a kmer database from the genome with kmc using the thresold
  • creating a kmer database from the reads with kmc using the thresold
  • comparing both databases and creating an instersection database
  • extraction the counts from genome and reads for intersecting kmers and calculating the ratio
  • ploting the ratio histogram

We are going to compare two genome assemblies performed with ONT and HiFi reads with the HiFi reads.

  • ont_genome.fasta is the ONT genome fasta file
  • hifi_genome.fasta is the HiFi genome fasta file
  • @hifi.list contains the urls of the HiFi read fasta files

Here are the corresponding command lines

mkdir tmp
# -ci2000 corresponds to 20 time de repeat threshold valuer, here you should chose a value close to coverage * repead threshold
kmc -v -k21 -m60 -t4 -ci2000 -cs1000001 @hifi.list  hifi_read.k21 tmp
# -ci2000 is the repeat threshold 
kmc -v -k21 -m60 -t4 -ci100 -cs1000001 -fm ont_genome.fasta ont_genome.k21 tmp
kmc -v -k21 -m60 -t4 -ci100 -cs1000001 -fm hifi_genome.fasta hifi_genome.k21 tmp
kmc_tools simple ont_genome.k21 hifi_read.k21 intersect common_hifi_read_ont_genome.k21 -ocleft
kmc_tools simple hifi_genome.k21 hifi_read.k21 intersect common_hifi_read_hifi_genome.k21 -ocleft
kmc_tools transform common_hifi_read_ont_genome.k21 dump common_hifi_read_ont_genome.k21.txt
kmc_tools transform common_hifi_read_hifi_genome.k21 dump common_hifi_read_hifi_genome.k21.txt
kmc_tools transform hifi_read.k21 dump hifi_read.k21.txt

sort -k1,1 common_hifi_read_ont_genome.k21.txt > common_hifi_read_ont_genome.k21.txt.sort
sort -k1,1 common_hifi_read_hifi_genome.k21.txt > common_hifi_read_hifi_genome.k21.txt.sort
sort -k1,1 hifi_read.k21.txt > hifi_read.k21.txt.sort

join -1 1 -2 1 common_hifi_read_ont_genome.k21.txt.sort hifi_read.k21.txt.sort | awk '{print $1"\t"$2"\t"$3"\t"($3/$2)"\t"(int(log($3/$2)/log(10)));}' > common_hifi_read_ont_genome.k21.txt.sort.ratio
join -1 1 -2 1 common_hifi_read_hifi_genome.k21.txt.sort hifi_read.k21.txt.sort | awk '{print $1"\t"$2"\t"$3"\t"($3/$2)"\t"(int(log($3/$2)/log(10)));}' > common_hifi_read_hifi_genome.k21.txt.sort.ratio

Once you have the ratio files you can load them into R and produce the plots.

library(ggplot2)
hifi <- read.table("common_hifi_read_hifi_genome.k21.txt.sort.ratio")
ont <- read.table("common_hifi_read_ont_genome.k21.txt.sort.ratio")
names(hifi) <- c("kmer","genome","reads","ratio","class")
names(ont) <- c("kmer","genome","reads","ratio","class")
hifi$origin <- "hifi"
ont$origin <- "ont"
all <- rbind(hifi,ont)
ggplot(all[all$origin == "ont" & all$reads != 1000001,] , aes(ratio, fill = origin)) + geom_histogram(binwidth =1) 
ggplot(all[all$origin == "hifi" & all$reads != 1000001,] , aes(ratio, fill = origin)) + geom_histogram(binwidth =1)

If the genome contains all the repeated kmers of the reads the histogram should look like :

HiFi image

This is the ONT histogram :

ONT image

The ONT histogram shows kmers having ratios out of the awaited coverage. Right hand kmers are more present in the reads than in the genome (repeat compression) and left hand kmers are more present in the genome than in the reads (repeat expension).

ONT repeated kmers are less abundant than HiFi repeated kmers hinting at kmers which are not seen as repeated in ONT because they are very lowly present in the genome.

gark's People

Contributors

chklopp avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.