Alignment mode for RRBS about gembs HOT 10 OPEN

heathsc commented on August 13, 2024

Alignment mode for RRBS

from gembs.

Comments (10)

heathsc commented on August 13, 2024

Hi Bekir, We would be quite happy to add this, but I need a bit more information. How would you propose that the prioritization should work - through resolving multimaps ? With RRBS do we expect each read fragment to start strictly at a restriction site? Cheers, Simon

…

On 28 Nov 2018, at 09:50, berguner ***@***.***> wrote: Hello, RRBS protocol relies on digestion based enrichment using certain restriction enzymes such as Taq1 and MspI. These enzymes recognize and cut at certain DNA patterns (T-CGA, C-CGG) so we expect most of the reads to be mapped on corresponding genomic regions. Is it possible to implement a feature in the gem-mapper to prioritize these regions? I think it is crucial for gemBS to have such feature to support the claim that it is supporting RRBS analysis. Best, Bekir — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#40>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADHPd7HjwzAVAIPwgf3yExkkuX7vq07oks5uzk5lgaJpZM4Y3H-q>.

from gembs.

berguner commented on August 13, 2024

Hi Simon,

I am not really an expert on this topic but I will try to explain a bit more. We have been using BSMAP which basically restricts the genome into the regions starting with restriction patterns. Therefore it does not map outside these regions and assigns the reads which were not enriched as unmapped. This is not really the ideal solution because it does not report mappings outside of restrictions sites and also does not report whether reads are multimaps or not. However it has better performance because it only searches a fraction of the genome and we can evaluate how well had the enrichment worked based on the amount of reads aligned onto the restriction sites.

In my opinion a good solution should be able to:
1- Differentiate reads starting with the restriction patterns and map them accordingly,
2- Also map un-enriched (non-restricted) reads,
3- Correctly score/resolve multimaps,
4- Report enrichment rates based on the amount of reads conforming to enrichment criteria (starting with restriction pattern and also mapping onto the restriction sites of the genome).

We can imagine the reads starting with the restriction pattern (MspI:CGG/TGG, Taq1:CGA/TGA) are carrying an extra base at the beginning (MspI:C-CGG/C-TGG, Taq1:T-CGA/T-TGA) and map them together with this imaginary base. These patterns should be given as program arguments. I believe adding this imaginary base should help prioritize the restriction sites as correct mapping regions and also should not disrupt the resolving of multimaps.

In our facility we primarily use MspI so most of the reads carry CGG/TGG pattern at their start position. In a good sample where the source DNA was not degraded more than %95 of the reads carry this pattern, so we can assume that they come from digested DNA fragments. You can see this on the FastQC per-base sequence content plot:

I hope this helps a little.

Best, Bekir

from gembs.

heathsc commented on August 13, 2024

Hi Bekir, OK, I have some ideas on this and I will try them out with a few RRBS datasets. We can clearly add mapping statistics counting the number of reads mappings to a restriction site and those mapping elsewhere. We can also modify the MAPQ calculation so that if a read map equally well to two locations, one of which starts with the restriction site, the location with the restriction site would be output as the primary mapping with a MAPQ of somewhere around 10-15. This could be justified if we say that we expect ~95% of reads to be in the enriched fraction. It’s not clear exactly what BSMAP is doing. The documentation is sparse and from the code their way of handling the RRBS appears to be very adhoc so I wouldn’t want to follow their example. When you say that BSMAP has better performance do you mean that it finds more uniquely mapping reads? Simon

…

On 28 Nov 2018, at 11:34, berguner ***@***.***> wrote: Hi Simon, I am not really an expert on this topic but I will try to explain a bit more. We have been using BSMAP which basically restricts the genome into the regions starting with restriction patterns. Therefore it does not map outside these regions and assigns the reads which were not enriched as unmapped. This is not really the ideal solution because it does not report mappings outside of restrictions sites and also does not report whether reads are multimaps or not. However it has better performance because it only searches a fraction of the genome and we can evaluate how well did the enrichment worked based on the amount of reads aligned onto the restriction sites. In my opinion a good solution should be able to: 1- Differentiate reads starting with the restriction patterns and map them accordingly, 2- Also map un-enriched (non-restricted) reads, 3- Correctly score multimaps, 4- Report enrichment rates based on the amount of reads conforming to enrichment criteria (starts with restriction pattern and also maps onto the genome the restriction pattern). We can imagine the reads starting with the restriction pattern (MspI:CGG/TGG, Taq1:CGA/TGA) are carrying an extra base at the beginning (MspI:C-CGG/C-TGG, Taq1:T-CGA/T-TGA) and map them together with this imaginary base. These patterns should be given as program arguments. I believe adding this imaginary base should help prioritize the restriction sites as correct mapping regions and also should not disrupt the resolving of multimaps. In our facility we primarily use MspI so most of the reads carry CGG/TGG pattern at their start position. In a good sample where the source DNA was not degraded more than %95 of the reads carry this pattern, so we can assume that they come from digested DNA fragments. You can see this on the FastQC per-base sequence content plot: <https://user-images.githubusercontent.com/23213155/49142710-bc9d3d00-f2f9-11e8-8b93-d8aa49a64b09.png> I hope this helps a little. Best, Bekir — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADHPdxnQFx3YpdGFlPQnNI58XeaoUlFHks5uzmaagaJpZM4Y3H-q>.

from gembs.

berguner commented on August 13, 2024

Hello Simon,

Thank you for looking into this, I think your solution will solve most of our problems.

BSMAP requires less CPU time and less memory when it is run in digestion/restriction mode in comparison to running in whole genome mode. In my experience, gem3-mapper was faster than BSMAP even if it was running in digestion mode. gem3-mapper requires more memory for searching whole genome, however this is not really a problem in our computing infrastructure. I would much prefer whole genome mapping capability rather than smaller memory footprint. So, I am not really concerned about the performance of gemBS.

Best,
Bekir

from gembs.

berguner commented on August 13, 2024

Hello Simon,

Did you get the chance to implement this feature?

from gembs.

heathsc commented on August 13, 2024

It is one of the projects I am working on at the moment. I will reply to this thread when I have something to test. Simon

…

On 27 Mar 2019, at 10:02, berguner ***@***.***> wrote: Hello Simon, Did you get the chance to implement this feature? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADHPd6QCFW01Jib3rme2l4R4X2r9DTy5ks5vazOjgaJpZM4Y3H-q>.

from gembs.

berguner commented on August 13, 2024

That's great! If you need more data for testing I can always provide.

Best,
Bekir

from gembs.

heathsc commented on August 13, 2024

If you have some testing datasets that would be great. It need not be an entire dataset - just 10 millions reads or so - and ideally one single end and one paired end. Simon

…

On 27 Mar 2019, at 10:07, berguner ***@***.***> wrote: That's great! If you need more data for testing I can always provide. Best, Bekir — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADHPd3ZeTAJMN3imaP8C-QQnlVEPnfrZks5vazTLgaJpZM4Y3H-q>.

from gembs.

berguner commented on August 13, 2024

You can use the data here:
test_data
Paired reads are 75bp and the single end reads are 50bp. The digestion enzyme is MspI so the digestion pattern is C-CGG.

Best, Bekir

from gembs.

berguner commented on August 13, 2024

Hi,

Is there any progress for this feature request? There are some large RRBS projects coming up and I would love to ditch our old pipeline and switch to gemBS for good.

Thanks again for looking into this,
Cheers,
Bekir

from gembs.

Alignment mode for RRBS about gembs HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs