snayfach / microbecensus Goto Github PK

View Code? Open in Web Editor NEW

41.0 41.0 16.0 182.13 MB

MicrobeCensus estimates the average genome size of microbial communities from metagenomic data

Home Page: http://genomebiology.com/2015/16/1/51

License: GNU General Public License v3.0

Python 95.71% R 4.29%

microbecensus's People

Contributors

Stargazers

Watchers

Forkers

lowks tarah28 palc python3pkg boulund haroon123 nigiord aekazakov liaohu1231 elenacabelloyeves xiangyang1984 anyihu hj1994412 blindner6

microbecensus's Issues

Handling errors with command line parameters

When I accidentally gave an incorrect path (i.e. the file did not exist) as input to microbe_census, the error printed suggests that I have supplied 'Incorrect number of command line arguments'. Any chance the actual error message could be printed out, or handled specifically?

Thanks!

Feature Request: Add support for DIAMOND

I've been using Huson's DIAMOND sequence similarity tool, and have been pleased with its performance. Please consider adding support.

question: calculation of RPKG

Hi @snayfach ,

I've been going back and forth between normalization methods for my data, and wasn't sure whether to trust the microbecensus results for some of my samples (very large AGS estimates, but also probably a high number of viruses). I finally decided to move forward with it, but have a question concerning the calculations:

I'm trying to use your suggested RPKG normalization for my KO abundance table. I'm not sure how to best do step 2 of this equation though. Is there a list of gene lengths for each KO? If so, where can I find it?
"The RPKG of a KO in a metagenome was computed by: 1) counting the number of reads mapped to the KO; 2) dividing (1) by the length of the KO in kilobase pairs; and 3) dividing the result of (2) by the number of sequenced genome equivalents.

Thanks a lot!

making sense of results...

Hi, thanks for this tool! Very easy to use and great documentation. Still, I'm struggeling with the output of my run. I'm working on a set of environmental samples that I would like to use for comparative analysis. However, the two types of sample sets differ in the contribution of eukaryotes (mainly diatoms), as well as viruses. As far as I understood from your paper, AGS may not be very reliable then, but RPKG may still be meaningful? I don't have a lot of examples to compare to, but I get the feeling that my estimates are very high, e.g.
Sample with very low expected contribution by eukaryotes:
Parameters
metagenome: CBmetaG_2.fastq.gz
reads_sampled: 2000000
trimmed_length: 150
min_quality: -5
mean_quality: -5
filter_dups: False
max_unknown: 100

Results
average_genome_size: 18141036.055869997
total_bases: 46309620851
genome_equivalents: 2552.7550195246613

And sample with very large expected contribution by eukaryotes:
Parameters
metagenome: SBmetaG_2.fastq.gz
reads_sampled: 2000000
trimmed_length: 150
min_quality: -5
mean_quality: -5
filter_dups: False
max_unknown: 100

Results
average_genome_size: 10211877.522966972
total_bases: 78896001286
genome_equivalents: 7725.905555423999

My gene prediction/counts were done on assembled reads, however as input for MicrobeCensus, I'm using the unassembled reads that were used as assembly input..
Am I doing this right?

Thanks for your help!

temp files when program exits of error

I have noticed that when MicrobeCensus errors out or execution is halted that the temp files created by the processes remain. You might want to consider altering the code to remove these files if the program exits prematurely.

Could not import module 'pkg_resources'

Following the install instructions, get this error:
Could not import module 'pkg_resources'

Linux version 3.10-3-amd64 ([email protected]) (gcc version 4.7.2 (Debian 4.7.2-5) ) #1 SMP Debian 3.10.11-1 (2013-09-10)

Python 2.7.3

MicrobeCensus should use requested # of cores

Even when specifying that MC use 16 cores, by monitoring with 'top' I see that it never utilizes more than two or three cores at a time. It took 3+ hours to process a 2.3 Gbyte read file, sampling all reads. It would have been much faster if it were able to fully exploit the available horsepower.

TypeError: cannot unpack non-iterable NoneType object in run_microbe_census.py

Hi,

Met a typeerror issue in conda microbecensus v1.1.1. Could you give my any suggestion? Many thanks!

My running code:
run_microbe_census.py -t 100 -n 100000000 SK0108G_A_1.fastq,SK0108G_A_2.fastq microbecensus/SK0108G_A_out

Error:
a bytes-like object is required, not 'str'
Traceback (most recent call last):
File "run_microbe_census.py", line 62, in
est_ags, args = microbe_census.run_pipeline(args)
^^^^^^^^^^^^^
TypeError: cannot unpack non-iterable NoneType object

installation on a cluster

Hi,
I am trying to install MicrobeCensus on the abel computing cluster here in Oslo, Norway. The cluster runs on linux 64 bit Centos 6

Since I lack sufficient permissions, I followed your alternative installation suggestion.
I can load python2 (v2.7.10) since that has the biopython and numpy packages.

I have tested if I can import numpy and Bio.SeqIO on the python commandline.
I can.
These packages are installed in the folder:

/cluster/software/VERSIONS/python_packages-2.7_3/cluster/software/VERSIONS/python2-2.7.9/lib/python2.7/site-packages/

This folder is found in my $PYTHONPATH after loading python2.

Next I added microbecensus to the $PYTHONPATH
and I added the scripts to the $PATH.

When I then call: run_microbe_census.py -h
I get:

Could not import module 'numpy'

So I believe I have followed the correct way of setting this up, so I am surprised numpy is not found.

Do you have any idea on how I can solve this?

MC should use the Unix exit code convention

MC reports "Error! No reads remaining after filtering!"

But the exit code is 0. Unix convention is to have a non-zero, non-negative exit code when there's an error.

This makes it harder to detect errors when running batches of MC runs via a script.

You're living in the past, man

Update copyright in -v output to 2015, before everyone pounces on your public domain code.

run_microbe_census.py crashes when input file not found

It should check whether the file exists before attempting to open it, and present a clean error message to the user.

differences with other normalization methods

Hi again

Thank you for your wonderful software. I love the fact that I am able to normalize using the genome nr estimate of each sample.
Out of curiosity, I tried to compare the microbe census results (from a group of 50 samples) to a previous normalization attempt using the sum of 16S copies in the same samples. However, the two normalization methods did not agree at all.
Would you say that this is mostly because 16S is a multicopy gene while microbecensus uses single copy genes?
Just wanted to hear your take in this

Thanks

Pip install error from official 1.1.0 tarball

Just created a clean conda environment with Python 3.6 and tried to install MicrobeCensus into it using pip, as instructed by the installation manual. Note that the conda environment is active when I tried the commands below, but I removed the bash prefix for brevity here.

I got this:

$ wget https://github.com/snayfach/MicrobeCensus/archive/v1.1.0.tar.gz -O MicrobeCensus-v1.1.0.tar.gz
$ pip install MicrobeCensus-v1.1.0.tar.gz
Processing ./MicrobeCensus-v1.1.0.tar.gz                                               
    Complete output from command python setup.py egg_info:                             
    Traceback (most recent call last):                                                 
      File "<string>", line 1, in <module>                                             
      File "/tmp/pip-gxty91hz-build/setup.py", line 12                                 
        mode = ((os.stat(fn).st_mode) | 0555) & 07777                                  
                                           ^                                           
    SyntaxError: invalid token                                                         
                                                                                       
    ----------------------------------------                                           
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-gxty91hz-build/

I see the same thing trying to install from pypi, but then it downloads an older version:

$ pip install MicrobeCensus
Collecting MicrobeCensus                                                                             
  Downloading MicrobeCensus-1.0.7.tar.gz (22.0MB)                                                    
    100% |████████████████████████████████| 22.0MB 31.7MB/s                                          
    Complete output from command python setup.py egg_info:                                           
    Traceback (most recent call last):                                                               
      File "<string>", line 1, in <module>                                                           
      File "/tmp/pip-build-v02fdl_2/MicrobeCensus/setup.py", line 12                                 
        mode = ((os.stat(fn).st_mode) | 0555) & 07777                                                
                                           ^                                                         
    SyntaxError: invalid token                                                                       
                                                                                                     
    ----------------------------------------                                                         
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-v02fdl_2/MicrobeCensus/

Test failed

Hi, just installed and ran the test in the "tests" directory with the following:

`.E.

ERROR: test_ags_estimation (main.Pipeline)

Traceback (most recent call last):
File "test_microbe_census.py", line 21, in setUp
self.observed = microbe_census.run_pipeline({'seqfiles':[self.infile]})[0]
File "/usr/lib/python2.7/site-packages/microbe_census/microbe_census.py", line 583, in run_pipeline
paths = get_relative_paths(args)
File "/usr/lib/python2.7/site-packages/microbe_census/microbe_census.py", line 109, in get_relative_paths
if args['rapsearch']:
KeyError: 'rapsearch'

Ran 3 tests in 0.066s

FAILED (errors=1)`

could not convert string to float: FIRM56_P641593085

Hello,

I was running MicrobeCensus on my pair end reads and it threw the error:

could not convert string to float: FIRM56_P641593085
Traceback (most recent call last):
File "/home/USR/miniconda2/bin/run_microbe_census.py", line 62, in
est_ags, args = microbe_census.run_pipeline(args)
TypeError: 'NoneType' object is not iterable

Is there any way of fixing this? Thanks!

Something wrong when put 'python setup.py install'

/envs/ham/lib/python3.7/site-packages/setuptools/command/install.py: 37: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other s tandards-based tools.
setuptools.SetuptoolsDeprecationWarning,
/envs/ham/lib/python3.7/site-packages/setuptools/command/easy_instal l.py:147: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
EasyInstallDeprecationWarning,
/envs/ham/lib/python3.7/site-packages/pkg_resources/init.py:125: PkgResourcesDeprecationWarning: 4.0.0-unsupported is an invalid version and will not be suppor ted in a future release
PkgResourcesDeprecationWarning,
running bdist_egg
running egg_info
creating MicrobeCensus.egg-info
writing MicrobeCensus.egg-info/PKG-INFO
writing dependency_links to MicrobeCensus.egg-info/dependency_links.txt
writing requirements to MicrobeCensus.egg-info/requires.txt
writing top-level names to MicrobeCensus.egg-info/top_level.txt
writing manifest file 'MicrobeCensus.egg-info/SOURCES.txt'
reading manifest file 'MicrobeCensus.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching 'training/README.txt'
adding license file 'LICENSE.txt'
writing manifest file 'MicrobeCensus.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib
creating build/lib/microbe_census
copying microbe_census/microbe_census.py -> build/lib/microbe_census
copying microbe_census/init.py -> build/lib/microbe_census
creating build/lib/tests
copying tests/test_microbe_census.py -> build/lib/tests
copying tests/init.py -> build/lib/tests
creating build/lib/microbe_census/data
copying microbe_census/data/pars.map -> build/lib/microbe_census/data
copying microbe_census/data/rapdb_2.15 -> build/lib/microbe_census/data
copying microbe_census/data/read_len.map -> build/lib/microbe_census/data
copying microbe_census/data/gene_fam.map -> build/lib/microbe_census/data
copying microbe_census/data/coefficients.map -> build/lib/microbe_census/data
copying microbe_census/data/rapdb_2.15.info -> build/lib/microbe_census/data
copying microbe_census/data/gene_len.map -> build/lib/microbe_census/data
copying microbe_census/data/weights.map -> build/lib/microbe_census/data
creating build/lib/microbe_census/bin
copying microbe_census/bin/rapsearch_Darwin_2.15 -> build/lib/microbe_census/bin
copying microbe_census/bin/rapsearch_Linux_2.15 -> build/lib/microbe_census/bin
copying microbe_census/bin/prerapsearch_Darwin_2.15 -> build/lib/microbe_census/bin
copying microbe_census/bin/prerapsearch_Linux_2.15 -> build/lib/microbe_census/bin
creating build/lib/microbe_census/example
copying microbe_census/example/example.fa.gz -> build/lib/microbe_census/example
copying microbe_census/example/example.fq.gz -> build/lib/microbe_census/example
creating build/lib/tests/data
copying tests/data/metagenome.fa.gz -> build/lib/tests/data
copying tests/data/community.txt -> build/lib/tests/data
creating build/bdist.linux-x86_64
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/microbe_census
creating build/bdist.linux-x86_64/egg/microbe_census/data
copying build/lib/microbe_census/data/pars.map -> build/bdist.linux-x86_64/egg/microbe_census/d ata
copying build/lib/microbe_census/data/rapdb_2.15 -> build/bdist.linux-x86_64/egg/microbe_census /data
copying build/lib/microbe_census/data/read_len.map -> build/bdist.linux-x86_64/egg/microbe_cens us/data
copying build/lib/microbe_census/data/gene_fam.map -> build/bdist.linux-x86_64/egg/microbe_cens us/data
copying build/lib/microbe_census/data/coefficients.map -> build/bdist.linux-x86_64/egg/microbe_ census/data
copying build/lib/microbe_census/data/rapdb_2.15.info -> build/bdist.linux-x86_64/egg/microbe_c ensus/data
copying build/lib/microbe_census/data/gene_len.map -> build/bdist.linux-x86_64/egg/microbe_cens us/data
copying build/lib/microbe_census/data/weights.map -> build/bdist.linux-x86_64/egg/microbe_censu s/data
creating build/bdist.linux-x86_64/egg/microbe_census/example
copying build/lib/microbe_census/example/example.fa.gz -> build/bdist.linux-x86_64/egg/microbe_ census/example
copying build/lib/microbe_census/example/example.fq.gz -> build/bdist.linux-x86_64/egg/microbe_ census/example
copying build/lib/microbe_census/microbe_census.py -> build/bdist.linux-x86_64/egg/microbe_cens us
creating build/bdist.linux-x86_64/egg/microbe_census/bin
copying build/lib/microbe_census/bin/rapsearch_Darwin_2.15 -> build/bdist.linux-x86_64/egg/micr obe_census/bin
copying build/lib/microbe_census/bin/rapsearch_Linux_2.15 -> build/bdist.linux-x86_64/egg/micro be_census/bin
copying build/lib/microbe_census/bin/prerapsearch_Darwin_2.15 -> build/bdist.linux-x86_64/egg/m icrobe_census/bin
copying build/lib/microbe_census/bin/prerapsearch_Linux_2.15 -> build/bdist.linux-x86_64/egg/mi crobe_census/bin
copying build/lib/microbe_census/init.py -> build/bdist.linux-x86_64/egg/microbe_census
creating build/bdist.linux-x86_64/egg/tests
creating build/bdist.linux-x86_64/egg/tests/data
copying build/lib/tests/data/metagenome.fa.gz -> build/bdist.linux-x86_64/egg/tests/data
copying build/lib/tests/data/community.txt -> build/bdist.linux-x86_64/egg/tests/data
copying build/lib/tests/test_microbe_census.py -> build/bdist.linux-x86_64/egg/tests
copying build/lib/tests/init.py -> build/bdist.linux-x86_64/egg/tests
byte-compiling build/bdist.linux-x86_64/egg/microbe_census/microbe_census.py to microbe_census. cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/microbe_census/init.py to init.cpython-37.p yc
byte-compiling build/bdist.linux-x86_64/egg/tests/test_microbe_census.py to test_microbe_census .cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/init.py to init.cpython-37.pyc
changing mode of build/bdist.linux-x86_64/egg/microbe_census/bin/rapsearch_Darwin_2.15 to 1141
changing mode of build/bdist.linux-x86_64/egg/microbe_census/bin/rapsearch_Linux_2.15 to 1141
changing mode of build/bdist.linux-x86_64/egg/microbe_census/bin/prerapsearch_Darwin_2.15 to 10 41
changing mode of build/bdist.linux-x86_64/egg/microbe_census/bin/prerapsearch_Linux_2.15 to 114 1
creating build/bdist.linux-x86_64/egg/EGG-INFO
installing scripts to build/bdist.linux-x86_64/egg/EGG-INFO/scripts
running install_scripts
running build_scripts
creating build/scripts-3.7
copying and adjusting scripts/run_microbe_census.py -> build/scripts-3.7
changing mode of build/scripts-3.7/run_microbe_census.py from 664 to 775
creating build/bdist.linux-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/run_microbe_census.py -> build/bdist.linux-x86_64/egg/EGG-INFO/script s
changing mode of build/bdist.linux-x86_64/egg/EGG-INFO/scripts/run_microbe_census.py to 775
copying MicrobeCensus.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying MicrobeCensus.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying MicrobeCensus.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying MicrobeCensus.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying MicrobeCensus.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
zip_safe flag not set; analyzing archive contents...
microbe_census.pycache.microbe_census.cpython-37: module references file
tests.pycache.test_microbe_census.cpython-37: module references file
creating dist
creating 'dist/MicrobeCensus-1.1.0-py3.7.egg' and adding 'build/bdist.linux-x86_64/egg' to it
error: [Errno 13] Permission denied: 'build/bdist.linux-x86_64/egg/microbe_census/bin/prerapsea rch_Darwin_2.15'

Then I used "MicrobeCensus/build/scripts-3.7/run_microbe_census.py", it prompted me "ModuleNotFoundError: No module named 'microbe_cencus'"

P.S. I can't sudo

problem reading multiple files

Hi thanks for the useful software.

Everything works fine when running with a single read file, but when trying to pass multiple files and/or paired read files I get errors like:

The following error was encountered when parsing sequence#2 in the input file: I/O operation on closed file

The weird thing is that the sequence # changes anywhere from 2 to 12. The other weird thing is that I got it to run once on test paired reads where I modified the header somehow but I can't remember what I did 😧

python version 2.7.14
MicrobeCensus version 1.1

best,
-shane

on my machine I can get a reproducible error using the two files:

test_1.fq
@NS500496_94_HCKNKBGXX:1:11101:14811:4036#GGACTCATAGAG/1
TTCGTAACCGATGTGAAGATCAGTTGCGGCACCAGAATAGTCTCCATCTGGATAAGAGATATTGCTCTCAACATTCACATATGGACCAGCAAAAGCTGCACCAGCGAGTAGGAATGGAGATGCTGCAACAGCAGCGATTGTTGATTTGAT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEE<EEAEEEEEEEEEEEEEEEEEEEEEEEEEAE
@NS500496_94_HCKNKBGXX:1:11101:15952:6689#GGACTCATAGAG/1
CAGTAAGACCAGTGCCAGTACCCCTGACATTAGTAGTAGCTACATCAGTTCCATTAGCAACATCAATGAGTACTTGGTTCCTTAGAATCCTAAGATTAGTAAAGTTCTCAATAGAGTCAGCAGGTACATCAAAGGTGGTAGTTCTATTGG
+
AAAAAEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEA
@NS500496_94_HCKNKBGXX:1:11101:12994:17022#GGACTCATAGAG/1
GCCGATGTTGAGTGGTTTTTATATACAAGTTCAGTTGGAGTATATCATCCAGCAGAAGTATTTAAAGAAGACGATGTGTGGAAAACATTTCCATCTGAACACGATTGGGAAGCAGGTTGGGCTAAAAGGATTGGTGAAATGAATGTTCAA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

test_2.fq
@NS500496_94_HCKNKBGXX:1:11101:14811:4036#GGACTCATAGAG/2
GCACACACCTAGTGTGACAGTTGTATAAATAACTTCATACAAAGGACTCGAAAGAATCGTAACCCTGCGTTGATGTAAACGGTTCCCCATGTCGGGGGAATTATCATCCGCAAGGGATTTTTTATTCTTGCGAGATACTTAACAAAC
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEE<EEEEE<EEEEEEEEEEEEEEEA<AAEEEA
@NS500496_94_HCKNKBGXX:1:11101:15952:6689#GGACTCATAGAG/2
ATATCATGTCATTATGCGTGACACTTATTATGCTGTCACACGTATTGATGATGAGAATAAGATCCTTGCATTTGATATTAAGGAGACAGAAAACACTGCTCTAATTAATAATGACTATAGGGTTCACCTTGACAACCGTGTTAATATCGA
+
AAAAAEEEEEEEEEEEEEEE/EEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEAAEEEEE<EEEEEEEEEEEEEEE<EE/<EAEEEEEEEAEEEAEEEEEEEEEEEEEEEEEEEEAEAEEEEEE<<EAEEEAEEEEEEEA
@NS500496_94_HCKNKBGXX:1:11101:12994:17022#GGACTCATAGAG/2
CCACACTTCTAATTTATCATTATCTACAGCTTTTTTAATCAATGATGGTATAACCATTGACCACTTACCAAAGTCATCATCTATCCCATATACATTAGCAGGTCTAACAATTGAACAACAATTCCAATTGTGTTCTTTCATATAAGCTT
+
AAAAAEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEAEEEEEEEEEEEEEEAEEEEEEEE<<EEEEEAEEE/EEEEAEEEEEEEEEEEEEEE/AEEEEA<E<AEAEEEEEE/EEEEEEEEEEEE

training.py: Unused test set?

Hi Stephen,
I just came across the line for evaluation which reads:
error += test_error(train_genomes, train_rates, prop_constant, genome2size)
Does it mean to use train_genomes, train_rates as the parameters?
Thanks

MicrobeCensus is not using all the reads when set to a large number

I have a test set of 1 million reads:

(microbecensus_env) -bash-4.1$ grep -c "^>" test.fasta
10 000 000

I ran the following command:

(microbecensus_env) -bash-4.1$ run_microbe_census.py -n 100000000 -t 4 ./test.fasta testing/microbeconsensus.txt

Here are the results. There are fewer reads than the initial input. (10000000 - 7576069 = 2423931):

(microbecensus_env) -bash-4.1$ cat testing/microbeconsensus.txt
Parameters
metagenome:	./test.fasta
reads_sampled:	7576069
trimmed_length:	200
min_quality:	-5
mean_quality:	-5
filter_dups:	False
max_unknown:	100

Results
average_genome_size:	9730897.86879
total_bases:	1960038815
genome_equivalents:	201.424251023

Do you know what could be happening here? Is it possible to set the value to -1 or something to not subsample at all? Also, if subsampling is done is it possible to add a seed argument so we can get reproducible results?

Doc request: TMPDIR

Hey Stephen,

Took your software for a nice spin again! This time, I hit performance issues when running it in parallel on a 64-core EC2 instance with almost half a Terabyte of RAM. I traced the issue to my flock of MicrobeCensus instances putting all of their temporary files on the slow EBS disk, rather than the nice RAMDISK I had provisioned for the running directory.

From poking around the Python code and RTFM, it seems that the choice of the mkstemp() location can be controlled using environment variables like TMPDIR:
https://docs.python.org/2/library/tempfile.html

I would ask if you could add some mention of this in the MicrobeCensus docs, in the hope that it might help the next person wondering how to optimize the disk I/O.

Thanks again for your quality software!

"Could not import module 'numpy'"

Hi there,

Just trying out MicrobeCensus but I cannot get it to work. For both the test and run_microbe_census.py I get:
Could not import module 'numpy'

I did the install using setup.py and have updated my numpy but I still get this error. Thoughts?

-Sheryl

does genome equivalents mean bacteria cell numbers

Installation on a cluster redux

I reviewed the closed issue regarding installing on a cluster, but unfortunately it didn't help me much. I'm having the same issue: although the numpy package is installed in a folder that is in my my $PYTHONPATH, I still get the

Could not import module 'numpy'

error.

I don't have pip available on my cluster, so I used python setup.py install --user, which seemed to work. I then followed Thomieh73's advice, but I'm still getting the numpy error. Any help? I'm trying to use the tool to analyze a bunch of large files and I don't have the space to do it on my home machine, even though it runs it fine.

Thanks in advance.

UPDATE: Turns out we do have pip, but I can't install via pip because it requires administrator privileges (which I don't have), so I can't use that option.

Python2 dependency in conda

Hi and thank you for this tool,

I've been trying to use MicrobeCensus in a conda environment. The README mentions:

Python version 2 or 3

but the conda package seems to enforce the use of python2.

microbecensus 1.1.1 0

file name : microbecensus-1.1.1-0.tar.bz2
name : microbecensus
version : 1.1.1
build string: 0
build number: 0
channel : https://conda.anaconda.org/bioconda/linux-64
size : 20.7 MB
arch : x86_64
constrains : ()
license : GNU General Public License v3 or later (GPLv3+)
license_family: GPL3
md5 : d36aac7d4a96c824fd82c073d30db207
platform : linux
subdir : linux-64
url : https://conda.anaconda.org/bioconda/linux-64/microbecensus-1.1.1-0.tar.bz2
dependencies:
biopython
numpy
python >=2.7,<3

This is a problem for me since I would like to use MicrobeCensus in a snakemake pipeline, but the packages are in conflict because of this dependency.

UnsatisfiableError: The following specifications were found to be in conflict:
 - microbecensus
 - snakemake

Which one is right? Is conda right to enforce the use of python2 with MicrobeCensus?

Best,
Nils

TypeError: 'NoneType' object is not iterable

Most times, it worked well. But I have found a problem though:

xmixu@bm3:~/PATH/TO/DIR$ run_microbe_census.py home/PATH/TO/DIR/SRR7280924.extendedFrags.fastq /home/PATH/TO/DIR/SRR7280924.flash_mc.txt -t 16 -n 100000000
integer division or modulo by zero
Traceback (most recent call last):
File "/share/apps/bio/bio/bin/run_microbe_census.py", line 62, in
est_ags, args = microbe_census.run_pipeline(args)
TypeError: 'NoneType' object is not iterable

And I looked deep into the codes, it was in the function of estimate_average_genome_size(args, paths, agg_hits). Here, sum_weights equals to 0, causing the error. Can you please help to figure out what was going on? Thank you in advance!

Best,
Xinming

MicrobeCensus crashes, "ValueError: Sequence and quality captions differ."

I let MC decide the file type and the FASTQ quality score encoding, so not sure how this happened.

Any help in figuring this out would be great. Thanks!

taltman1@corn02:/dev/shm/taltman1_tmp/MicrobeCensus$ time run_microbe_census.py -n 40711 -l 500 -t 16 my.fastq test.out
Traceback (most recent call last):
File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/scripts/run_microbe_census.py", line 48, in
est_ags, args = microbe_census.run_pipeline(args)
File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/microbe_census/microbe_census.py", line 480, in run_pipeline
process_seqfile(args, paths)
File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/microbe_census/microbe_census.py", line 273, in process_seqfile
for rec in parse(open_file(args['seqfile']), args['fastq_format'] if args['file_type'] == 'fastq' else 'fasta'):
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/init.py", line 582, in parse
for r in i:
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 1033, in FastqPhredIterator
for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 922, in FastqGeneralIterator
raise ValueError("Sequence and quality captions differ.")
ValueError: Sequence and quality captions differ.

real 0m9.063s
user 0m8.930s
sys 0m0.169s

Feature Request: command line parameter to control subsampling

I want to sample more of my reads to see if it adjusts MC's results. Considering that it's already so fast (processing all of my data in a minute), I'm willing to splurge and sample more to see if the accuracy improves slightly. For example, in Additional Figure 4 from the paper, the GI Tract has more divergence than its peers at 500k reads sampled. In my runs, it is only sampling 344k reads, and it is totally unclear to me where that number comes from.

On a related note, if -n is not an option controlling sampling, then I'm not quite sure what it is for. If I simply want to limit my input to N entries, I'd use head or awk. Is it the first N, or is it a sampling of the full input?

I think that this should be better documented.

Gene Sequences in Fasta

Hello,

Any chance you have the universal genes in fasta format somewhere?

How the performance would be for metatranscriptomes?

Hi Stephen,

This is an awesome program that helped me a lot when to normalize the gene abundance across different samples. I just wonder can this software be used for metatranscriptomic data?

Cheers,
Heyu

snayfach / microbecensus Goto Github PK

microbecensus's People

Contributors

Stargazers

Watchers

Forkers

microbecensus's Issues

`.E.

ERROR: test_ags_estimation (main.Pipeline)

microbecensus 1.1.1 0

Recommend Projects

Recommend Topics

Recommend Org

Jobs