snayfach / microbecensus Goto Github PK

MicrobeCensus estimates the average genome size of microbial communities from metagenomic data

Home Page: http://genomebiology.com/2015/16/1/51

License: GNU General Public License v3.0

Python 95.71% R 4.29%

microbecensus's Introduction

MicrobeCensus

MicrobeCensus is a fast and easy to use pipeline for estimating the average genome size (AGS) of a microbial community from metagenomic data.

In short, AGS is estimated by aligning reads to a set of universal single-copy gene families present in nearly all cellular microbes (Bacteria, Archaea, Fungi). Because these genes occur once per genome, the average genome size of a microbial community is inversely proportional to the fraction of reads which hit these genes.

Once AGS is obtained, it becomes possible to obtain the total coverage of microbial genomes present in a sample (genome equivalents = total bp sequenced/AGS in bp), which can be useful for normalizing gene abundances.

Requirements

Python dependencies (installed via setup.py): Numpy, BioPython
Supported platforms: Mac OSX, Unix/Linux; Windows not currently supported
Python version 2 or 3

Installation

Clone the repo:
git clone https://github.com/snayfach/MicrobeCensus

Or, download the latest release from: https://github.com/snayfach/MicrobeCensus/releases

Unpack the project as necessary and navigate to the installation directory:
cd /path/to/MicrobeCensus

Run setup.py. This will install any dependencies:
python setup.py install or
sudo python setup.py install to install as a superuser

Alternatively, MicrobeCensus can be installed using pip (may not be latest version):
pip install MicrobeCensus or
sudo pip install MicrobeCensus to install as a superuser, or
pip install --user MicrobeCensus to install in your home directory

You can also install using conda (may not be latest version):
conda install -c bioconda microbecensus

Using MicrobeCensus without installing

Although this is not recommended, users may wish to run MicrobeCensus without running setup.py.

Both BioPython and Numpy will both need to be already installed. You should be able to enter the following command in the python interpreter without getting an error:
>>> import Bio.SeqIO
>>> import numpy

Next, add the MicrobeCensus module to your PYTHONPATH environmental variable:
export PYTHONPATH=$PYTHONPATH:/path/to/MicrobeCensus or
echo -e "\nexport PYTHONPATH=\$PYTHONPATH:/path/to/MicrobeCensus" >> ~/.bash_profile to avoid entering the command in the future

Finally, add the scripts directory to your PATH environmental variable:
export PATH=$PATH:/path/to/MicrobeCensus/scripts or
echo -e "\nexport PATH=\$PATH:/path/to/MicrobeCensus/scripts" >> ~/.bash_profile to avoid entering the command in the future

Now, you should be able to enter the command into your terminal without getting an error:
run_microbe_census.py -h

Testing the software

After installing MicrobeCensus, we recommend testing the software:
cd /path/to/MicrobeCensus/test
python test_microbe_census.py

Running MicrobeCensus

MicrobeCensus can either be run as a command-line script or imported to python as a module.

Command-line usage

run_microbe_census.py [-options] seqfiles outfile

Input/Output (required):

SEQFILES
path to input metagenome(s)
for paired-end metagenomes use commas to specify each file (ex: read_1.fq.gz,read_2.fq.gz)
can be FASTQ/FASTA
can be gzip (.gz) or bzip (.bz2) compressed
OUTFILE
path to output file containing AGS estimate

Pipeline throughput (optional):

-n NREADS
number of reads to sample from seqfile and use for AGS estimation.
to use all reads set to large number, like 100000000
(default = 2000000)
-t THREADS
number of threads to use for database search (default= 1)
-e
quit after average genome size is obtained and do not estimate the number of genome equivalents in SEQFILES.
useful in combination with -n for quick tests (default = False)

Quality control (optional):

-l {50,60,70,80,90,100,110,120,130,140,150,175,200,225,250,300,350,400,450,500}
all reads trimmed to this length; reads shorter than this are discarded
(default = median read length)
-q MIN_QUALITY
minimum base-level PHRED quality score (default = -5)
-m MEAN_QUALITY
minimum read-level PHRED quality score (default = -5)
-d
filter duplicate reads (default = False)
-u MAX_UNKNOWN
max percent of unknown bases per read (default = 100)

Misc options:

-h, --help: show this help message and exit
-v: print program's progress to stdout (default = False)
-V, --version: show program's version number and exit
-r RAPSEARCH
path to external RAPsearch2 v2.15 binary.
useful if precompiled RAPsearch2 v2.15 binary included with MicrobeCensus does not work on your system

Module usage

First, import the module:
>>> from microbe_census import microbe_census

Next, setup your options and arguments, formatted as a dictionary. The path to your metagenome is the only requirement (default values will be used for all other options):
>>> args = {'seqfiles':['MicrobeCensus/microbe_census/example/example.fq.gz']}

If you have paired-end libraries, separate them with a comma:
>>> args = {'seqfiles':['seqfile_1.fq.gz', 'seqfile_2.fq.gz']}

Alternatively, other options can be specified:

>>> args = {
  'seqfiles':['MicrobeCensus/microbe_census/example/example.fq.gz'],
  'nreads':100000,
  'read_length':100,
  'threads':1,
  'min_quality':10,
  'mean_quality':10,
  'filter_dups':False,
  'max_unknown':0,
  'verbose':True}

Finally, the entire pipeline can be run by passing your arguments to the run_pipeline function. MicrobeCensus returns the estimated AGS of your metagenome, along with a dictionary of used arguments: average_genome_size, args = microbe_census.run_pipeline(args)

For normalization, you can also estimate the number of genome equivalents in your metagenome:
count_bases = microbe_census.count_bases(args['seqfiles'])
genome_equivalents = count_bases/average_genome_size

Recommended options

When in doubt, use default parameters! In most cases, MicrobeCensus tries to pick the best parameters for you.
For more accurate estimates of AGS, use -n to increase the number of reads sampled. The default value of 2,000,000 should give good results, but more reads may result in slightly more accurate estimates, particularly when AGS is very large.
Don't use quality filtering options (-q, -m, -d, -u) if you plan on using MicrobeCensus for normalization. In this case, MicrobeCensus should be directly run on the metagenome you used for estimating gene-family abundances.
Use -v/--verbose to print program progress

Temporary files

MicrobeCensus writes several temporary files to disk. The location where temporary files are written are determined by the environmental variable TMPDIR. You can change this location as follows:
export TMPDIR=/new/location/for/temorary/files

Output format

Parameters
metagenome: path to your metagenome(s)
reads_sampled: the number of reads sampled from the metagenome to estimate AGS
trimmed_length: reads were trimmed to this length to estimate AGS
min_quality: minimum per-base quality
mean_quality: minimum average-base quality
filter_dups: filter exact duplicate reads
max_unknown: filter reads where the % of Ns is greater than this

Results
average_genome_size: the average genome size (in bp) of your input metagenome
total_bases: the total number of base-pairs of your input metagenome
genome_equivalents: the total coverage of microbial genomes in your input metagenome

Normalization

The number of genome equivalents can be used to normalize count data obtained from metagenomes using the statistic RPKG (reads per kb per genome equivalent). This is similar to the commonly used statistic RPKM, but instead of dividing by the number of total mapped reads, we divide by the number of genome equivalents:

RPKG = (reads mapped to gene)/(gene length in kb)/(genome equivalents)

Use case: We have two metagenomic libraries, L1 and L2, and we use MicrobeCensus to estimate the number of genome equivilants in each:

GE_L1 = 40
GE_L2 = 20

Next, we map reads from each library to a reference database which contains a gene of interest G. G is 1000 bp long. We get 100 reads mapped to gene G from each library:

LENGTH_G = 1,000 bp
MAPPED_READS_G_L1 = 100
MAPPED_READS_G_L2 = 100

Finally, we quantify RPKG for gene G in each library:

RPKG for G in L1 = (100 mapped reads)/(1 kb)/(40 GE) = 2.5
RPKG for G in L2 = (100 mapped reads)/(1 kb)/(20 GE) = 5.0

Software speed

Run times are for a 150 bp library. Expect longer/shorter runtimes depending on read length.

Threads (-t)	Reads/Second
1	830
2	1,300
4	1,800
8	2,000

Training

We have included scripts and documentation for retraining MicrobeCensus, using user-supplied training genomes and gene families. Documentation and scripts can be found under: MicrobeCensus/training

Citing

If you use MicrobeCensus, please cite:

Nayfach, S. and Pollard, K.S. Average genome size estimation improves comparative metagenomics and sheds light on the functional ecology of the human microbiome. Genome biology 2015;16(1):51.

microbecensus's People

Contributors

Stargazers

Watchers

Forkers

lowks tarah28 palc python3pkg boulund haroon123 nigiord aekazakov liaohu1231 elenacabelloyeves xiangyang1984 anyihu hj1994412 blindner6

microbecensus's Issues

Pip install error from official 1.1.0 tarball

Just created a clean conda environment with Python 3.6 and tried to install MicrobeCensus into it using pip, as instructed by the installation manual. Note that the conda environment is active when I tried the commands below, but I removed the bash prefix for brevity here.

I got this:

$ wget https://github.com/snayfach/MicrobeCensus/archive/v1.1.0.tar.gz -O MicrobeCensus-v1.1.0.tar.gz
$ pip install MicrobeCensus-v1.1.0.tar.gz
Processing ./MicrobeCensus-v1.1.0.tar.gz                                               
    Complete output from command python setup.py egg_info:                             
    Traceback (most recent call last):                                                 
      File "<string>", line 1, in <module>                                             
      File "/tmp/pip-gxty91hz-build/setup.py", line 12                                 
        mode = ((os.stat(fn).st_mode) | 0555) & 07777                                  
                                           ^                                           
    SyntaxError: invalid token                                                         
                                                                                       
    ----------------------------------------                                           
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-gxty91hz-build/

I see the same thing trying to install from pypi, but then it downloads an older version:

$ pip install MicrobeCensus
Collecting MicrobeCensus                                                                             
  Downloading MicrobeCensus-1.0.7.tar.gz (22.0MB)                                                    
    100% |████████████████████████████████| 22.0MB 31.7MB/s                                          
    Complete output from command python setup.py egg_info:                                           
    Traceback (most recent call last):                                                               
      File "<string>", line 1, in <module>                                                           
      File "/tmp/pip-build-v02fdl_2/MicrobeCensus/setup.py", line 12                                 
        mode = ((os.stat(fn).st_mode) | 0555) & 07777                                                
                                           ^                                                         
    SyntaxError: invalid token                                                                       
                                                                                                     
    ----------------------------------------                                                         
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-v02fdl_2/MicrobeCensus/

could not convert string to float: FIRM56_P641593085

Hello,

I was running MicrobeCensus on my pair end reads and it threw the error:

could not convert string to float: FIRM56_P641593085
Traceback (most recent call last):
File "/home/USR/miniconda2/bin/run_microbe_census.py", line 62, in
est_ags, args = microbe_census.run_pipeline(args)
TypeError: 'NoneType' object is not iterable

Is there any way of fixing this? Thanks!

MicrobeCensus crashes, "ValueError: Sequence and quality captions differ."

I let MC decide the file type and the FASTQ quality score encoding, so not sure how this happened.

Any help in figuring this out would be great. Thanks!

taltman1@corn02:/dev/shm/taltman1_tmp/MicrobeCensus$ time run_microbe_census.py -n 40711 -l 500 -t 16 my.fastq test.out
Traceback (most recent call last):
File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/scripts/run_microbe_census.py", line 48, in
est_ags, args = microbe_census.run_pipeline(args)
File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/microbe_census/microbe_census.py", line 480, in run_pipeline
process_seqfile(args, paths)
File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/microbe_census/microbe_census.py", line 273, in process_seqfile
for rec in parse(open_file(args['seqfile']), args['fastq_format'] if args['file_type'] == 'fastq' else 'fasta'):
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/init.py", line 582, in parse
for r in i:
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 1033, in FastqPhredIterator
for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 922, in FastqGeneralIterator
raise ValueError("Sequence and quality captions differ.")
ValueError: Sequence and quality captions differ.

real 0m9.063s
user 0m8.930s
sys 0m0.169s

problem reading multiple files

Hi thanks for the useful software.

Everything works fine when running with a single read file, but when trying to pass multiple files and/or paired read files I get errors like:

The following error was encountered when parsing sequence#2 in the input file: I/O operation on closed file

The weird thing is that the sequence # changes anywhere from 2 to 12. The other weird thing is that I got it to run once on test paired reads where I modified the header somehow but I can't remember what I did 😧

python version 2.7.14
MicrobeCensus version 1.1

best,
-shane

on my machine I can get a reproducible error using the two files:

test_1.fq
@NS500496_94_HCKNKBGXX:1:11101:14811:4036#GGACTCATAGAG/1
TTCGTAACCGATGTGAAGATCAGTTGCGGCACCAGAATAGTCTCCATCTGGATAAGAGATATTGCTCTCAACATTCACATATGGACCAGCAAAAGCTGCACCAGCGAGTAGGAATGGAGATGCTGCAACAGCAGCGATTGTTGATTTGAT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEE<EEAEEEEEEEEEEEEEEEEEEEEEEEEEAE
@NS500496_94_HCKNKBGXX:1:11101:15952:6689#GGACTCATAGAG/1
CAGTAAGACCAGTGCCAGTACCCCTGACATTAGTAGTAGCTACATCAGTTCCATTAGCAACATCAATGAGTACTTGGTTCCTTAGAATCCTAAGATTAGTAAAGTTCTCAATAGAGTCAGCAGGTACATCAAAGGTGGTAGTTCTATTGG
+
AAAAAEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEA
@NS500496_94_HCKNKBGXX:1:11101:12994:17022#GGACTCATAGAG/1
GCCGATGTTGAGTGGTTTTTATATACAAGTTCAGTTGGAGTATATCATCCAGCAGAAGTATTTAAAGAAGACGATGTGTGGAAAACATTTCCATCTGAACACGATTGGGAAGCAGGTTGGGCTAAAAGGATTGGTGAAATGAATGTTCAA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

test_2.fq
@NS500496_94_HCKNKBGXX:1:11101:14811:4036#GGACTCATAGAG/2
GCACACACCTAGTGTGACAGTTGTATAAATAACTTCATACAAAGGACTCGAAAGAATCGTAACCCTGCGTTGATGTAAACGGTTCCCCATGTCGGGGGAATTATCATCCGCAAGGGATTTTTTATTCTTGCGAGATACTTAACAAAC
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEE<EEEEE<EEEEEEEEEEEEEEEA<AAEEEA
@NS500496_94_HCKNKBGXX:1:11101:15952:6689#GGACTCATAGAG/2
ATATCATGTCATTATGCGTGACACTTATTATGCTGTCACACGTATTGATGATGAGAATAAGATCCTTGCATTTGATATTAAGGAGACAGAAAACACTGCTCTAATTAATAATGACTATAGGGTTCACCTTGACAACCGTGTTAATATCGA
+
AAAAAEEEEEEEEEEEEEEE/EEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEAAEEEEE<EEEEEEEEEEEEEEE<EE/<EAEEEEEEEAEEEAEEEEEEEEEEEEEEEEEEEEAEAEEEEEE<<EAEEEAEEEEEEEA
@NS500496_94_HCKNKBGXX:1:11101:12994:17022#GGACTCATAGAG/2
CCACACTTCTAATTTATCATTATCTACAGCTTTTTTAATCAATGATGGTATAACCATTGACCACTTACCAAAGTCATCATCTATCCCATATACATTAGCAGGTCTAACAATTGAACAACAATTCCAATTGTGTTCTTTCATATAAGCTT
+
AAAAAEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEAEEEEEEEEEEEEEEAEEEEEEEE<<EEEEEAEEE/EEEEAEEEEEEEEEEEEEEE/AEEEEA<E<AEAEEEEEE/EEEEEEEEEEEE

MicrobeCensus is not using all the reads when set to a large number

I have a test set of 1 million reads:

(microbecensus_env) -bash-4.1$ grep -c "^>" test.fasta
10 000 000

I ran the following command:

(microbecensus_env) -bash-4.1$ run_microbe_census.py -n 100000000 -t 4 ./test.fasta testing/microbeconsensus.txt

Here are the results. There are fewer reads than the initial input. (10000000 - 7576069 = 2423931):

(microbecensus_env) -bash-4.1$ cat testing/microbeconsensus.txt
Parameters
metagenome:	./test.fasta
reads_sampled:	7576069
trimmed_length:	200
min_quality:	-5
mean_quality:	-5
filter_dups:	False
max_unknown:	100

Results
average_genome_size:	9730897.86879
total_bases:	1960038815
genome_equivalents:	201.424251023

Do you know what could be happening here? Is it possible to set the value to -1 or something to not subsample at all? Also, if subsampling is done is it possible to add a seed argument so we can get reproducible results?

Python2 dependency in conda

Hi and thank you for this tool,

I've been trying to use MicrobeCensus in a conda environment. The README mentions:

Python version 2 or 3

but the conda package seems to enforce the use of python2.

microbecensus 1.1.1 0

file name : microbecensus-1.1.1-0.tar.bz2
name : microbecensus
version : 1.1.1
build string: 0
build number: 0
channel : https://conda.anaconda.org/bioconda/linux-64
size : 20.7 MB
arch : x86_64
constrains : ()
license : GNU General Public License v3 or later (GPLv3+)
license_family: GPL3
md5 : d36aac7d4a96c824fd82c073d30db207
platform : linux
subdir : linux-64
url : https://conda.anaconda.org/bioconda/linux-64/microbecensus-1.1.1-0.tar.bz2
dependencies:
biopython
numpy
python >=2.7,<3

This is a problem for me since I would like to use MicrobeCensus in a snakemake pipeline, but the packages are in conflict because of this dependency.

UnsatisfiableError: The following specifications were found to be in conflict:
 - microbecensus
 - snakemake

Which one is right? Is conda right to enforce the use of python2 with MicrobeCensus?

Best,
Nils

Could not import module 'pkg_resources'

Following the install instructions, get this error:
Could not import module 'pkg_resources'

Linux version 3.10-3-amd64 ([email protected]) (gcc version 4.7.2 (Debian 4.7.2-5) ) #1 SMP Debian 3.10.11-1 (2013-09-10)

Python 2.7.3

making sense of results...

Hi, thanks for this tool! Very easy to use and great documentation. Still, I'm struggeling with the output of my run. I'm working on a set of environmental samples that I would like to use for comparative analysis. However, the two types of sample sets differ in the contribution of eukaryotes (mainly diatoms), as well as viruses. As far as I understood from your paper, AGS may not be very reliable then, but RPKG may still be meaningful? I don't have a lot of examples to compare to, but I get the feeling that my estimates are very high, e.g.
Sample with very low expected contribution by eukaryotes:
Parameters
metagenome: CBmetaG_2.fastq.gz
reads_sampled: 2000000
trimmed_length: 150
min_quality: -5
mean_quality: -5
filter_dups: False
max_unknown: 100

Results
average_genome_size: 18141036.055869997
total_bases: 46309620851
genome_equivalents: 2552.7550195246613

And sample with very large expected contribution by eukaryotes:
Parameters
metagenome: SBmetaG_2.fastq.gz
reads_sampled: 2000000
trimmed_length: 150
min_quality: -5
mean_quality: -5
filter_dups: False
max_unknown: 100

Results
average_genome_size: 10211877.522966972
total_bases: 78896001286
genome_equivalents: 7725.905555423999

My gene prediction/counts were done on assembled reads, however as input for MicrobeCensus, I'm using the unassembled reads that were used as assembly input..
Am I doing this right?

Thanks for your help!

run_microbe_census.py crashes when input file not found

It should check whether the file exists before attempting to open it, and present a clean error message to the user.

temp files when program exits of error

I have noticed that when MicrobeCensus errors out or execution is halted that the temp files created by the processes remain. You might want to consider altering the code to remove these files if the program exits prematurely.

Something wrong when put 'python setup.py install'

/envs/ham/lib/python3.7/site-packages/setuptools/command/install.py: 37: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other s tandards-based tools.
setuptools.SetuptoolsDeprecationWarning,
/envs/ham/lib/python3.7/site-packages/setuptools/command/easy_instal l.py:147: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
EasyInstallDeprecationWarning,
/envs/ham/lib/python3.7/site-packages/pkg_resources/init.py:125: PkgResourcesDeprecationWarning: 4.0.0-unsupported is an invalid version and will not be suppor ted in a future release
PkgResourcesDeprecationWarning,
running bdist_egg
running egg_info
creating MicrobeCensus.egg-info
writing MicrobeCensus.egg-info/PKG-INFO
writing dependency_links to MicrobeCensus.egg-info/dependency_links.txt
writing requirements to MicrobeCensus.egg-info/requires.txt
writing top-level names to MicrobeCensus.egg-info/top_level.txt
writing manifest file 'MicrobeCensus.egg-info/SOURCES.txt'
reading manifest file 'MicrobeCensus.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching 'training/README.txt'
adding license file 'LICENSE.txt'
writing manifest file 'MicrobeCensus.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib
creating build/lib/microbe_census
copying microbe_census/microbe_census.py -> build/lib/microbe_census
copying microbe_census/init.py -> build/lib/microbe_census
creating build/lib/tests
copying tests/test_microbe_census.py -> build/lib/tests
copying tests/init.py -> build/lib/tests
creating build/lib/microbe_census/data
copying microbe_census/data/pars.map -> build/lib/microbe_census/data
copying microbe_census/data/rapdb_2.15 -> build/lib/microbe_census/data
copying microbe_census/data/read_len.map -> build/lib/microbe_census/data
copying microbe_census/data/gene_fam.map -> build/lib/microbe_census/data
copying microbe_census/data/coefficients.map -> build/lib/microbe_census/data
copying microbe_census/data/rapdb_2.15.info -> build/lib/microbe_census/data
copying microbe_census/data/gene_len.map -> build/lib/microbe_census/data
copying microbe_census/data/weights.map -> build/lib/microbe_census/data
creating build/lib/microbe_census/bin
copying microbe_census/bin/rapsearch_Darwin_2.15 -> build/lib/microbe_census/bin
copying microbe_census/bin/rapsearch_Linux_2.15 -> build/lib/microbe_census/bin
copying microbe_census/bin/prerapsearch_Darwin_2.15 -> build/lib/microbe_census/bin
copying microbe_census/bin/prerapsearch_Linux_2.15 -> build/lib/microbe_census/bin
creating build/lib/microbe_census/example
copying microbe_census/example/example.fa.gz -> build/lib/microbe_census/example
copying microbe_census/example/example.fq.gz -> build/lib/microbe_census/example
creating build/lib/tests/data
copying tests/data/metagenome.fa.gz -> build/lib/tests/data
copying tests/data/community.txt -> build/lib/tests/data
creating build/bdist.linux-x86_64
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/microbe_census
creating build/bdist.linux-x86_64/egg/microbe_census/data
copying build/lib/microbe_census/data/pars.map -> build/bdist.linux-x86_64/egg/microbe_census/d ata
copying build/lib/microbe_census/data/rapdb_2.15 -> build/bdist.linux-x86_64/egg/microbe_census /data
copying build/lib/microbe_census/data/read_len.map -> build/bdist.linux-x86_64/egg/microbe_cens us/data
copying build/lib/microbe_census/data/gene_fam.map -> build/bdist.linux-x86_64/egg/microbe_cens us/data
copying build/lib/microbe_census/data/coefficients.map -> build/bdist.linux-x86_64/egg/microbe_ census/data
copying build/lib/microbe_census/data/rapdb_2.15.info -> build/bdist.linux-x86_64/egg/microbe_c ensus/data
copying build/lib/microbe_census/data/gene_len.map -> build/bdist.linux-x86_64/egg/microbe_cens us/data
copying build/lib/microbe_census/data/weights.map -> build/bdist.linux-x86_64/egg/microbe_censu s/data
creating build/bdist.linux-x86_64/egg/microbe_census/example
copying build/lib/microbe_census/example/example.fa.gz -> build/bdist.linux-x86_64/egg/microbe_ census/example
copying build/lib/microbe_census/example/example.fq.gz -> build/bdist.linux-x86_64/egg/microbe_ census/example
copying build/lib/microbe_census/microbe_census.py -> build/bdist.linux-x86_64/egg/microbe_cens us
creating build/bdist.linux-x86_64/egg/microbe_census/bin
copying build/lib/microbe_census/bin/rapsearch_Darwin_2.15 -> build/bdist.linux-x86_64/egg/micr obe_census/bin
copying build/lib/microbe_census/bin/rapsearch_Linux_2.15 -> build/bdist.linux-x86_64/egg/micro be_census/bin
copying build/lib/microbe_census/bin/prerapsearch_Darwin_2.15 -> build/bdist.linux-x86_64/egg/m icrobe_census/bin
copying build/lib/microbe_census/bin/prerapsearch_Linux_2.15 -> build/bdist.linux-x86_64/egg/mi crobe_census/bin
copying build/lib/microbe_census/init.py -> build/bdist.linux-x86_64/egg/microbe_census
creating build/bdist.linux-x86_64/egg/tests
creating build/bdist.linux-x86_64/egg/tests/data
copying build/lib/tests/data/metagenome.fa.gz -> build/bdist.linux-x86_64/egg/tests/data
copying build/lib/tests/data/community.txt -> build/bdist.linux-x86_64/egg/tests/data
copying build/lib/tests/test_microbe_census.py -> build/bdist.linux-x86_64/egg/tests
copying build/lib/tests/init.py -> build/bdist.linux-x86_64/egg/tests
byte-compiling build/bdist.linux-x86_64/egg/microbe_census/microbe_census.py to microbe_census. cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/microbe_census/init.py to init.cpython-37.p yc
byte-compiling build/bdist.linux-x86_64/egg/tests/test_microbe_census.py to test_microbe_census .cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/tests/init.py to init.cpython-37.pyc
changing mode of build/bdist.linux-x86_64/egg/microbe_census/bin/rapsearch_Darwin_2.15 to 1141
changing mode of build/bdist.linux-x86_64/egg/microbe_census/bin/rapsearch_Linux_2.15 to 1141
changing mode of build/bdist.linux-x86_64/egg/microbe_census/bin/prerapsearch_Darwin_2.15 to 10 41
changing mode of build/bdist.linux-x86_64/egg/microbe_census/bin/prerapsearch_Linux_2.15 to 114 1
creating build/bdist.linux-x86_64/egg/EGG-INFO
installing scripts to build/bdist.linux-x86_64/egg/EGG-INFO/scripts
running install_scripts
running build_scripts
creating build/scripts-3.7
copying and adjusting scripts/run_microbe_census.py -> build/scripts-3.7
changing mode of build/scripts-3.7/run_microbe_census.py from 664 to 775
creating build/bdist.linux-x86_64/egg/EGG-INFO/scripts
copying build/scripts-3.7/run_microbe_census.py -> build/bdist.linux-x86_64/egg/EGG-INFO/script s
changing mode of build/bdist.linux-x86_64/egg/EGG-INFO/scripts/run_microbe_census.py to 775
copying MicrobeCensus.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying MicrobeCensus.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying MicrobeCensus.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying MicrobeCensus.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying MicrobeCensus.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
zip_safe flag not set; analyzing archive contents...
microbe_census.pycache.microbe_census.cpython-37: module references file
tests.pycache.test_microbe_census.cpython-37: module references file
creating dist
creating 'dist/MicrobeCensus-1.1.0-py3.7.egg' and adding 'build/bdist.linux-x86_64/egg' to it
error: [Errno 13] Permission denied: 'build/bdist.linux-x86_64/egg/microbe_census/bin/prerapsea rch_Darwin_2.15'

Then I used "MicrobeCensus/build/scripts-3.7/run_microbe_census.py", it prompted me "ModuleNotFoundError: No module named 'microbe_cencus'"

P.S. I can't sudo

training.py: Unused test set?

Hi Stephen,
I just came across the line for evaluation which reads:
error += test_error(train_genomes, train_rates, prop_constant, genome2size)
Does it mean to use train_genomes, train_rates as the parameters?
Thanks

does genome equivalents mean bacteria cell numbers

question: calculation of RPKG

Hi @snayfach ,

I've been going back and forth between normalization methods for my data, and wasn't sure whether to trust the microbecensus results for some of my samples (very large AGS estimates, but also probably a high number of viruses). I finally decided to move forward with it, but have a question concerning the calculations:

I'm trying to use your suggested RPKG normalization for my KO abundance table. I'm not sure how to best do step 2 of this equation though. Is there a list of gene lengths for each KO? If so, where can I find it?
"The RPKG of a KO in a metagenome was computed by: 1) counting the number of reads mapped to the KO; 2) dividing (1) by the length of the KO in kilobase pairs; and 3) dividing the result of (2) by the number of sequenced genome equivalents.

Thanks a lot!

Feature Request: Add support for DIAMOND

I've been using Huson's DIAMOND sequence similarity tool, and have been pleased with its performance. Please consider adding support.

Handling errors with command line parameters

When I accidentally gave an incorrect path (i.e. the file did not exist) as input to microbe_census, the error printed suggests that I have supplied 'Incorrect number of command line arguments'. Any chance the actual error message could be printed out, or handled specifically?

Thanks!

"Could not import module 'numpy'"

Hi there,

Just trying out MicrobeCensus but I cannot get it to work. For both the test and run_microbe_census.py I get:
Could not import module 'numpy'

I did the install using setup.py and have updated my numpy but I still get this error. Thoughts?

-Sheryl

MicrobeCensus should use requested # of cores

Even when specifying that MC use 16 cores, by monitoring with 'top' I see that it never utilizes more than two or three cores at a time. It took 3+ hours to process a 2.3 Gbyte read file, sampling all reads. It would have been much faster if it were able to fully exploit the available horsepower.

TypeError: cannot unpack non-iterable NoneType object in run_microbe_census.py

Hi,

Met a typeerror issue in conda microbecensus v1.1.1. Could you give my any suggestion? Many thanks!

My running code:
run_microbe_census.py -t 100 -n 100000000 SK0108G_A_1.fastq,SK0108G_A_2.fastq microbecensus/SK0108G_A_out

Error:
a bytes-like object is required, not 'str'
Traceback (most recent call last):
File "run_microbe_census.py", line 62, in
est_ags, args = microbe_census.run_pipeline(args)
^^^^^^^^^^^^^
TypeError: cannot unpack non-iterable NoneType object

You're living in the past, man

Update copyright in -v output to 2015, before everyone pounces on your public domain code.

installation on a cluster

Hi,
I am trying to install MicrobeCensus on the abel computing cluster here in Oslo, Norway. The cluster runs on linux 64 bit Centos 6

Since I lack sufficient permissions, I followed your alternative installation suggestion.
I can load python2 (v2.7.10) since that has the biopython and numpy packages.

I have tested if I can import numpy and Bio.SeqIO on the python commandline.
I can.
These packages are installed in the folder:

/cluster/software/VERSIONS/python_packages-2.7_3/cluster/software/VERSIONS/python2-2.7.9/lib/python2.7/site-packages/

This folder is found in my $PYTHONPATH after loading python2.

Next I added microbecensus to the $PYTHONPATH
and I added the scripts to the $PATH.

When I then call: run_microbe_census.py -h
I get:

Could not import module 'numpy'

So I believe I have followed the correct way of setting this up, so I am surprised numpy is not found.

Do you have any idea on how I can solve this?

Doc request: TMPDIR

Hey Stephen,

Took your software for a nice spin again! This time, I hit performance issues when running it in parallel on a 64-core EC2 instance with almost half a Terabyte of RAM. I traced the issue to my flock of MicrobeCensus instances putting all of their temporary files on the slow EBS disk, rather than the nice RAMDISK I had provisioned for the running directory.

From poking around the Python code and RTFM, it seems that the choice of the mkstemp() location can be controlled using environment variables like TMPDIR:
https://docs.python.org/2/library/tempfile.html

I would ask if you could add some mention of this in the MicrobeCensus docs, in the hope that it might help the next person wondering how to optimize the disk I/O.

Thanks again for your quality software!

Test failed

Hi, just installed and ran the test in the "tests" directory with the following:

`.E.

ERROR: test_ags_estimation (main.Pipeline)

Traceback (most recent call last):
File "test_microbe_census.py", line 21, in setUp
self.observed = microbe_census.run_pipeline({'seqfiles':[self.infile]})[0]
File "/usr/lib/python2.7/site-packages/microbe_census/microbe_census.py", line 583, in run_pipeline
paths = get_relative_paths(args)
File "/usr/lib/python2.7/site-packages/microbe_census/microbe_census.py", line 109, in get_relative_paths
if args['rapsearch']:
KeyError: 'rapsearch'

Ran 3 tests in 0.066s

FAILED (errors=1)`

How the performance would be for metatranscriptomes?

Hi Stephen,

This is an awesome program that helped me a lot when to normalize the gene abundance across different samples. I just wonder can this software be used for metatranscriptomic data?

Cheers,
Heyu

TypeError: 'NoneType' object is not iterable

Most times, it worked well. But I have found a problem though:

xmixu@bm3:~/PATH/TO/DIR$ run_microbe_census.py home/PATH/TO/DIR/SRR7280924.extendedFrags.fastq /home/PATH/TO/DIR/SRR7280924.flash_mc.txt -t 16 -n 100000000
integer division or modulo by zero
Traceback (most recent call last):
File "/share/apps/bio/bio/bin/run_microbe_census.py", line 62, in
est_ags, args = microbe_census.run_pipeline(args)
TypeError: 'NoneType' object is not iterable

And I looked deep into the codes, it was in the function of estimate_average_genome_size(args, paths, agg_hits). Here, sum_weights equals to 0, causing the error. Can you please help to figure out what was going on? Thank you in advance!

Best,
Xinming

MC should use the Unix exit code convention

MC reports "Error! No reads remaining after filtering!"

But the exit code is 0. Unix convention is to have a non-zero, non-negative exit code when there's an error.

This makes it harder to detect errors when running batches of MC runs via a script.

Installation on a cluster redux

I reviewed the closed issue regarding installing on a cluster, but unfortunately it didn't help me much. I'm having the same issue: although the numpy package is installed in a folder that is in my my $PYTHONPATH, I still get the

Could not import module 'numpy'

error.

I don't have pip available on my cluster, so I used python setup.py install --user, which seemed to work. I then followed Thomieh73's advice, but I'm still getting the numpy error. Any help? I'm trying to use the tool to analyze a bunch of large files and I don't have the space to do it on my home machine, even though it runs it fine.

Thanks in advance.

UPDATE: Turns out we do have pip, but I can't install via pip because it requires administrator privileges (which I don't have), so I can't use that option.

Feature Request: command line parameter to control subsampling

I want to sample more of my reads to see if it adjusts MC's results. Considering that it's already so fast (processing all of my data in a minute), I'm willing to splurge and sample more to see if the accuracy improves slightly. For example, in Additional Figure 4 from the paper, the GI Tract has more divergence than its peers at 500k reads sampled. In my runs, it is only sampling 344k reads, and it is totally unclear to me where that number comes from.

On a related note, if -n is not an option controlling sampling, then I'm not quite sure what it is for. If I simply want to limit my input to N entries, I'd use head or awk. Is it the first N, or is it a sampling of the full input?

I think that this should be better documented.

differences with other normalization methods

Hi again

Thank you for your wonderful software. I love the fact that I am able to normalize using the genome nr estimate of each sample.
Out of curiosity, I tried to compare the microbe census results (from a group of 50 samples) to a previous normalization attempt using the sum of 16S copies in the same samples. However, the two normalization methods did not agree at all.
Would you say that this is mostly because 16S is a multicopy gene while microbecensus uses single copy genes?
Just wanted to hear your take in this

Thanks

Gene Sequences in Fasta