GithubHelp home page GithubHelp logo

soedinglab / bammmotif2 Goto Github PK

View Code? Open in Web Editor NEW
12.0 10.0 5.0 9.45 MB

Bayesian Markov Model motif discovery tool version 2 - An expectation maximization algorithm for the de novo discovery of enriched motifs as modelled by higher-order Markov models.

Home Page: https://bammmotif.mpibpc.mpg.de/

License: GNU General Public License v3.0

CMake 1.11% R 28.86% C++ 63.51% Shell 0.10% Python 6.22% C 0.19%
motif-discovery motif-analysis chip-seq bioinformatics ngs-analysis

bammmotif2's People

Contributors

anjakiesel avatar matthias-siebert avatar meiermark avatar mwess avatar wge11 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bammmotif2's Issues

bug in joint probability calculation

There are a two issues with the homogenous markov models stored in the hbp and hbpc files

  • the precision is only four digits, which makes the probabilities not sum to 1
  • the joint probabilities in the hbp file sum to a value greater than one. It seems that the first entry of orders >1 is way too large.

Additional info about how to use the software

Hi, I have been using the server version for some time for motif discovery and I wanted to add the shell version to my pipeline. However, when I run:
BaMMmotif DIRPATH FILEPATH

I get the following message:

Error: No initial model is provided.

Am I missing some necessary step/required file?

boost in cmake

We should check for availability of boost in the cmake file so that we can output a proper error message during build phase.

File format specifications

If I am not mistaken, we currently do not have specifications for the some of our own file formats such as occurrence files.

Similar to MMseqs, I think we should start using the wiki for expanding the documentation step by step.

plotbamm: weird information calculation?

Is there a rationale behind not defining the information content as max_entropy - entropy here?

https://github.com/soedinglab/bamm-private/blob/master/R/plotBaMM.R#L259-L261

informationContent <- function( x, base=2 ){
    ifelse( all( x > 0 ), 2 + sum( x * log( x, base ) ), 0 )
}

This sets the information to zero if there's at least one nucleotide occurring with rel. frequency of zero, leading to very odd behavior:

> informationContent(c(0.25, 0.25, 0.25, 0.25))
0
> informationContent( c(0.01, 0.01, 0.01, 0.97))
1.75805926714679
> informationContent( c(0.0001, 0.0001, 0.0001, 0.9997))
1.99558094270164
> informationContent(c(0, 0, 0, 1))
0

bug: vector allocation fails because of short sequences

command call:
BaMMmotif ./mcf7_GATA_narrow /home/mmeier/git/PEnG-motif/scripts/mcf7_GATA_narrow/mcf7_GATA_narrow.fasta --PWMFile /home/mmeier/git/PEnG-motif/scripts/mcf7_GATA_narrow/mcf7_GATA_narrow.tmp.out --FDR --savePvalues -K 2 --zoops

Result:
terminate called after throwing an instance of 'std::length_error'
what(): vector::_M_default_append

Backtrace with gdb:
#10 Motif::initFromPWM (this=0x51546f0, PWM=PWM@entry=0x66b680, asize=4, count=) at /home/mmeier/git/bamm-private/src/bamm/Motif.cpp:255

when printing LW1:
LW1 = -7

There seems to be a very short sequence in the dataset:

chr8:93080412-93080414
AC

Can you catch this?

build failure

current build failure on my mac

/Users/ch/repo/bamm-private/src/shared/SequenceSet.cpp:357:50: error: cannot take the address of
 an rvalue of type 'std::__1::basic_ostringstream<char, std::__1::char_traits<char>, std::__1::a
llocator<char> >'
                                                header = static_cast<std::ostringstream*>( &( st
d::ostringstream() << ( N+1 ) ) )->str();
                                                                                           ^  ~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

http://pastebin.com/AfXfdZwH

install all R files as command line scripts

We will need to call the R scripts from various other scripts. It'd be best to install them automatically in the cmake file. It should be configured such that make install moves the script to the bin folder of the given installation prefix.

For this to work properly all R files should use /usr/bin/env Rscriptin the shebang and avoid hardcoding the path to Rscript

adjustments to the server

  • skip N's while reading in sequences;
  • adjust occur2bed.py so that it automatically converts occurrence file to bed file when header information in input FASTA file is provided;
  • read in multiple BaMM files at the same time;
  • look into the problem where it takes longer time for EM optimization as given BaMM than MEME as seeds;
  • look into the problem where log odds distribution of positives is shifted to the left of that of negatives;
  • investigate on why TGACTCA motif is everywhere;
  • write out the hyper-parameters to the BaMM model file: e.g. k, alpha
  • check if the bed file positions are correct.
  • how to treat the short sequences fairly.
  • v[k] does not sum up to Y[k+1]
  • ignore the usual end-of-line char in the input files.
  • write Uni_test for tracing the changes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.