GithubHelp home page GithubHelp logo

fanglab / mbin Goto Github PK

View Code? Open in Web Editor NEW
25.0 25.0 3.0 5.25 MB

mBin: a methylation-based binning framework for metagenomic SMRT sequencing reads

License: Other

Makefile 1.09% Python 98.20% Awk 0.11% Dockerfile 0.60%

mbin's People

Contributors

fanggang avatar jbeaulaurier avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mbin's Issues

error message : h5py.h5py_warnings.H5pyDeprecationWarning: dataset.value has been deprecated. Use dataset[()] instead.

  • mbin version: 1.1.1
  • Python version:2.7.8
  • Operating System: centOS

Description

While trying to extract control IPDs with buildcontrols, get an error message : h5py.h5py_warnings.H5pyDeprecationWarning: dataset.value has been deprecated. Use dataset[()] instead.
I don't really know how to deal with it. I thought it, again, is some kind of version problem, but I'm not able to recognize.
I googled for this error message, but got nothing useful.

What I Did

`(venv)` [yx@localhost data]$ buildcontrols -i --procs=4 --control_pkl_name=control.pkl aligned_reads.cmp.h5
2019-02-01 16:58:55 [INFO] Initiating dictionary of all possible motifs...
2019-02-01 16:58:55 [INFO]   - Adding 256 4-mer motifs...
2019-02-01 16:58:55 [INFO] Done: 256 possible contiguous motifs

2019-02-01 16:58:55 [INFO]   - Adding 1024 5-mer motifs...
2019-02-01 16:58:55 [INFO] Done: 1536 possible contiguous motifs

2019-02-01 16:58:55 [INFO]   - Adding 4096 6-mer motifs...
2019-02-01 16:58:55 [INFO] Done: 7680 possible contiguous motifs

2019-02-01 16:58:55 [INFO]   - Adding bipartite motifs to search space...
2019-02-01 16:58:56 [INFO] Done: 194560 possible bipartite motifs

2019-02-01 16:58:56 [INFO] 
2019-02-01 16:58:56 [INFO] Preparing to create new control data in ctrl_tmp
Traceback (most recent call last):
  File "/data2/Software/virtualenv_for_mbin/venv/bin/buildcontrols", line 10, in <module>
    sys.exit(launch())
  File "/data2/Software/virtualenv_for_mbin/venv/lib/python2.7/site-packages/mbin/controls.py", line 20, in launch
    extract_controls(opts, control_aln_fn)
  File "/data2/Software/virtualenv_for_mbin/venv/lib/python2.7/site-packages/mbin/controls.py", line 40, in extract_controls
    opts           = controls.scan_WGA_aligns()
  File "/data2/Software/virtualenv_for_mbin/venv/lib/python2.7/site-packages/mbin/controls.py", line 352, in scan_WGA_aligns
    reader = openIndexedAlignmentFile(self.control_aln_fn)
  File "/data2/Software/virtualenv_for_mbin/venv/lib/python2.7/site-packages/pbcore/io/opener.py", line 52, in openIndexedAlignmentFile
    return CmpH5Reader(fname, sharedIndex=sharedIndex)
  File "/data2/Software/virtualenv_for_mbin/venv/lib/python2.7/site-packages/pbcore/io/align/CmpH5IO.py", line 729, in __init__
    self._loadAlignmentInfo(sharedIndex)
  File "/data2/Software/virtualenv_for_mbin/venv/lib/python2.7/site-packages/pbcore/io/align/CmpH5IO.py", line 745, in _loadAlignmentInfo
    rawAlignmentIndex = self.file["/AlnInfo/AlnIndex"].value
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/data2/Software/virtualenv_for_mbin/venv/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 313, in value
    "Use dataset[()] instead.", H5pyDeprecationWarning)
h5py.h5py_warnings.H5pyDeprecationWarning: dataset.value has been deprecated. Use dataset[()] instead.

Bug report

  • mbin version: 1.1.0
  • Python version: 2.7.3
  • Operating System: Ubuntu 12.04 LTS

Description

Dear sir,
I used the virtual environment and installed thses packages:
backports.functools-lru-cache (1.5)
biopython (1.70)
cycler (0.10.0)
Cython (0.28)
h5py (2.7.1)
kiwisolver (1.0.1)
matplotlib (2.2.0)
mbin (1.1.0)
numpy (1.13.3)
pbcore (1.2.10)
pip (9.0.2)
pyparsing (2.2.0)
pysam (0.14)
python-dateutil (2.7.0)
pytz (2018.3)
scipy (1.0.0)
setuptools (38.6.0)
six (1.11.0)
subprocess32 (3.2.7)
wheel (0.30.0)

I want to use buildcontrols. And I used data in the github. But something is wrong. Would you like to help me? Thank you.

What I Did

(venv) zoubinserver@ubuntuZ:/mbin/mbin/data$ buildcontrols -i aligned_reads.cmp.h5
Traceback (most recent call last):
File "/home/zoubinserver/venv/bin/buildcontrols", line 11, in
sys.exit(launch())
File "/home/zoubinserver/venv/local/lib/python2.7/site-packages/mbin/controls.py", line 17, in launch
opts,control_aln_fn = __parseArgs()
File "/home/zoubinserver/venv/local/lib/python2.7/site-packages/mbin/controls.py", line 206, in __parseArgs
opts.ref = os.path.abspath(opts.ref)
File "/home/zoubinserver/venv/lib/python2.7/posixpath.py", line 343, in abspath
if not isabs(path):
File "/home/zoubinserver/venv/lib/python2.7/posixpath.py", line 53, in isabs
return s.startswith('/')
AttributeError: 'NoneType' object has no attribute 'startswith'
(venv) zoubinserver@ubuntuZ:
/mbin/mbin/data$ buildcontrols m140905_042212_sidney_c100564852550000001823085912221377_s1_X0.aligned_subreads.cmp.h5
Traceback (most recent call last):
File "/home/zoubinserver/venv/bin/buildcontrols", line 11, in
sys.exit(launch())
File "/home/zoubinserver/venv/local/lib/python2.7/site-packages/mbin/controls.py", line 17, in launch
opts,control_aln_fn = __parseArgs()
File "/home/zoubinserver/venv/local/lib/python2.7/site-packages/mbin/controls.py", line 206, in __parseArgs
opts.ref = os.path.abspath(opts.ref)
File "/home/zoubinserver/venv/lib/python2.7/posixpath.py", line 343, in abspath
if not isabs(path):
File "/home/zoubinserver/venv/lib/python2.7/posixpath.py", line 53, in isabs
return s.startswith('/')
AttributeError: 'NoneType' object has no attribute 'startswith'

buildcontrols fails while reading the bam file

I'm using the current version of mbin, on Linux & Python 2.7.17 (Suse Linux version 4.12.14-150.63-default)

I want to use mbin on a metagenome to assigning plasmids to genomes and improve binning.

I installed mbin as described in the documentation. I obtained a WGA dataset from pacbio to use as the IPD control. This is a mock metagenome containing bacteria as well as two yeast species. I mapped the reads to a concatenated reference of only the bacterial species using pbmm2 aligner v1.2.0 from Pacbio.

I run buildcontrols on the aligned bam file generated by pbbm2:

buildcontrols --procs=10 --ref=bacterial_refs_concat.fa aligned.bam

buildcontrols fails with this output:

`2021-02-26 14:20:43 [INFO] Initiating dictionary of all possible motifs...
2021-02-26 14:20:43 [INFO] - Adding 256 4-mer motifs...
2021-02-26 14:20:43 [INFO] Done: 256 possible contiguous motifs

2021-02-26 14:20:43 [INFO] - Adding 1024 5-mer motifs...
2021-02-26 14:20:43 [INFO] Done: 1536 possible contiguous motifs

2021-02-26 14:20:43 [INFO] - Adding 4096 6-mer motifs...
2021-02-26 14:20:43 [INFO] Done: 7680 possible contiguous motifs

2021-02-26 14:20:43 [INFO] - Adding bipartite motifs to search space...
2021-02-26 14:20:44 [INFO] Done: 194560 possible bipartite motifs

2021-02-26 14:20:44 [INFO]
2021-02-26 14:20:44 [INFO] Preparing to create new control data in ctrl_tmp
Traceback (most recent call last):
File "/global/cscratch1/sd/vsevim/software/my_p27/bin/buildcontrols", line 8, in
sys.exit(launch())
File "/global/cscratch1/sd/vsevim/software/my_p27/lib/python2.7/site-packages/mbin/controls.py", line 20, in launch
extract_controls(opts, control_aln_fn)
File "/global/cscratch1/sd/vsevim/software/my_p27/lib/python2.7/site-packages/mbin/controls.py", line 40, in extract_controls
opts = controls.scan_WGA_aligns()
File "/global/cscratch1/sd/vsevim/software/my_p27/lib/python2.7/site-packages/mbin/controls.py", line 352, in scan_WGA_aligns
reader = openIndexedAlignmentFile(self.control_aln_fn)
File "/global/cscratch1/sd/vsevim/software/my_p27/lib/python2.7/site-packages/pbcore/io/opener.py", line 54, in openIndexedAlignmentFile
return IndexedBamReader(fname, referenceFastaFname=referenceFastaFname, sharedIndex=sharedIndex)
File "/global/cscratch1/sd/vsevim/software/my_p27/lib/python2.7/site-packages/pbcore/io/align/BamIO.py", line 385, in init
super(IndexedBamReader, self).init(fname, referenceFastaFname)
File "/global/cscratch1/sd/vsevim/software/my_p27/lib/python2.7/site-packages/pbcore/io/align/BamIO.py", line 198, in init
self._loadReferenceInfo()
File "/global/cscratch1/sd/vsevim/software/my_p27/lib/python2.7/site-packages/pbcore/io/align/BamIO.py", line 73, in _loadReferenceInfo
refMD5s = [r["M5"] for r in refRecords]
KeyError: 'M5'`

It seems like the bam reader is looking for the 'M5' field in the file but, I can confirm that there is no such field in the bam header.

Do you have any suggestions on how to solve this issue?

Thanks!

Nanopore

Dear mBin team,

Is it possible to run mBin on Nanopore data? I see in the paper that you mention it would be possible but all usage are for SMRT data. Is it possible to use ONT data ?

Thanks,
Luc

Add small test data set for installation validation

  • mbin version: 1.0
  • Python version: 2.7

Description

Need to identify a good (small) test set of aligned and unaligned reads to be included in the package for testing purposes.

Details

There are some good data sets included in the pbcore data:

  1. BAM of aligned reads:
    m140905_042212_sidney_c100564852550000001823085912221377_s1_X0.aligned_subreads.bam
    here

  2. cmp.h5 of aligned reads:
    m140905_042212_sidney_c100564852550000001823085912221377_s1_X0.aligned_subreads.cmp.h5
    here

  3. bax.h5 of unaligned E. coli K12 reads:
    m140912_020930_00114_c100702482550000001823141103261590_s1_p0
    movie.1.bax.h5
    movie.2.bax.h5
    movie.3.bax.h5
    bas.h5

  4. BAM of unaligned reads:
    possibly here
    otherwise the bax2bam output of m140912_020930_00114_c100702482550000001823141103261590_s1_p0 above

Add support for in silico control data in filtermotifs

  • mbin version: v1.0

Currently the buildcontrols program extracts control IPD information from whole-genome amplified (WGA) sequencing data, then uses these control IPD values for motif filtering. However, it would be useful to be able to get the control IPD values not from WGA data (which is not always available), but instead from the PacBio in silico IPD models that exist for each SMRT sequencing chemistry release.

Potential dependency conflicts between mbin and numpy

Hi, as shown in the following full dependency graph of mbin, mbin requires numpy >=1.7.1,<1.14, mbin requires matplotlib >=1.5.0 (matplotlib 3.2.1 will be installed, i.e., the newest version satisfying the version constraint), and directed dependency matplotlib 3.2.1 transitively introduces numpy >=1.11.

Obviously, there are multiple version constraints set for numpy in this project. However, according to pip's “first found wins” installation strategy, numpy 1.13.3 (i.e., the newest version satisfying constraint >=1.7.1,<1.14) is the actually installed version.

Although the first found package version numpy 1.13.3 just satisfies the later dependency constraint (numpy >=1.7.1,<1.14), such installed version is very close to the upper bound of the version constraint of numpy specified by matplotlib 3.2.1.

Once matplotlib upgrades,its newest version will be installed. Therefore, it will easily cause a dependency conflict (build failure), if the upgraded matplotlib version introduces a higher version of numpy, violating its another version constraint >=1.7.1,<1.14.

According to the release history of matplotlib, it habitually upgrates Numpy in its recent releases. For instance, matplotlib #15645 upgrated Numpy’s constraint from >=1.11 to >=1.12, and matplotlib #15698 upgrated Numpy’s constraint from >=1.12 to >=1.15.

As such, it is a warm warning of a potential dependency conflict issue for mbin.

Dependency tree

mbin - 1.1.1
| +- biopython(install version:1.76 version range:>=1.6.1)
| | +- numpy(install version:1.13.3 version range:*)
| +- h5py(install version:2.10.0 version range:>=2.0.1)
| +- matplotlib(install version:3.2.1 version range:>=1.5.0)
| | +- cycler(install version:0.10.0 version range:>=0.10)
| | | +- six(install version:1.14.0 version range:*)
| | +- kiwisolver(install version:1.2.0 version range:>=1.0.1)
| | +- numpy(install version:1.13.3 version range:>=1.11)
| | +- pyparsing(install version:3.0.0a1 version range:>=2.0.1)
| | +- python-dateutil(install version:2.8.1 version range:>=2.1)
| +- numpy(install version:1.13.3 version range:>=1.7.1,<1.14)
| +- pbcore(install version:1.2.10 version range:>=0.9.4)
| | +- cython(install version:3.0a1 version range:*)
| | +- h5py(install version:2.10.0 version range:>=2.0.1)
| | +- numpy(install version:1.13.3 version range:>=1.7.1)
| | +- pysam(install version:0.15.4 version range:>=0.9.0)
| | | +- cython(install version:3.0a1 version range:>=0.29.12)
| +- pysam(install version:0.15.4 version range:>=0.10.0)
| | +- cython(install version:3.0a1 version range:>=0.29.12)
| +- scipy(install version:1.2.3 version range:>=0.12.0)

Thanks for your help.
Best,
Neolith

Switch to an open source license

The current license is CC BY-NC-SA 4.0, which is not an open source software license.

From the Creative Commons FAQ:

We recommend against using Creative Commons licenses for software. Instead, we strongly encourage you to use one of the very good software licenses which are already available. We recommend considering licenses made available by the Free Software Foundation or listed as “open source” by the Open Source Initiative.

Unlike software-specific licenses, CC licenses do not contain specific terms about the distribution of source code, which is often important to ensuring the free reuse and modifiability of software. Many software licenses also address patent rights, which are important to software but may not be applicable to other copyrightable works. Additionally, our licenses are currently not compatible with the major software licenses, so it would be difficult to integrate CC-licensed work with other free software. Existing software licenses were designed specifically for use with software and offer a similar set of rights to the Creative Commons licenses.

The paper mentions funding from R01-GM114472-01, so I'm assuming this work is intended to be reusable by the public!

A list of recommended open source software licenses is available from the OSI. Happy to answer any questions to the best of my abilities on the implications of different licenses.

About the input file format

Hi,

My pacbio data format is bam and bam.pbi, but the input file format of mbin is cmp.h5. Is mbin compatible with bam? If not, is there a way to transform a bam file to cmp.h5?

Thanks.

Add --rebase_motifs option for speeding up filtermotifs step

  • mbin version: 1.0
  • Python version: 2.7

Feature request

It would be very nice to have an alternative means to completely de novo motif detection. Instead of querying a massive space of all possible k-mers and bipartite motif configurations (several hundred thousand), it would be nice to be able to only gather IPD scores for the ~1000 motifs that have been previously identified in bacterial methylomes and are present in the REBASE database. Applying this option would greatly speed up the filtermotifs pipeline.

file format mismacth

  • mbin version:1.1.1
  • Python version:2.7
  • Operating System:Linux Ubuntu

Description

"""How are you.I am honor to send emails to you ask some questions.
After reading your article 'Metagenomic binning and association of plasmids with bacterial host genomes using DNA methylation',I want to conduct the same analyse of metagenomic data in Nanopore sequence.I read your document about "mBin" in GitHub.But your analyse is carried by pacbio dataset.You use the wga_aligned_reads.cmp.h5 file to obtain a set of control IPD values,but this file is unique in pacbio sequence.While Nanopore sequence only have fast5 files.So what should I do to get the same files in Nanopore sequence?
"""

What I Did

###Can you give me some ideas about this problem?
###Thank you very much and I am looking forward to your replay.

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

###buildcontrols -i --procs=4 --control_pkl_name=control_means.pkl ###wga_aligned_reads.cmp.h5

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.