GithubHelp home page GithubHelp logo

snayfach / midas Goto Github PK

View Code? Open in Web Editor NEW
114.0 114.0 53.0 109.03 MB

An integrated pipeline for estimating strain-level genomic variation from metagenomic data

Home Page: http://dx.doi.org/10.1101/gr.201863.115

License: GNU General Public License v3.0

Python 96.05% Perl 3.95%

midas's Introduction

Metagenomic Intra-Species Diversity Analysis System (MIDAS)

MIDAS is an integrated pipeline that leverages >30,000 reference genomes to estimate bacterial species abundance and strain-level genomic variation, including gene content and SNPs, from shotgun metagnomes.

Table of Contents

  1. Getting started
  1. Reference databse
  1. Run MIDAS on a single sample:
  1. Merge MIDAS results across samples:
  1. Example scripts for analyzing gene content and SNPs:
  1. Citing

midas's People

Contributors

jbrodriguezmueller avatar mingzhi avatar snayfach avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

midas's Issues

test fails

Banging my head against the wall for awhile... python test_midas.py fails with the following message:
| => python test_midas.py
.....F....

FAIL: test_help_text (main.MergeSNPs)

Traceback (most recent call last):
File "test_midas.py", line 143, in test_help_text
self.assertTrue(sum(self.retcodes)==0, msg=error)
AssertionError:

Failed to execute the command: merge_midas.py snps


Ran 10 tests in 288.881s

FAILED (failures=1)

using a brew installed python, packages are current (numpy, pysam, biopython, pandas). Mac OS X Sierra 10.12.5. Any ideas?

Missing species in MIDAS

Hi,
First of all, thank you for publishing MIDAS.
I was wondering why Methanobrevibacter smithii has no representative genomes.
It appears in genome_taxonomy.txt but not in genome_taxonomy.txt
Thank for your help

Failed to execute the command: merge_midas.py genes error

I just installed MIDAS and runed python test_midas.py -f
I got the following error:
python test_midas.py -f
./home/qiime/anaconda3/lib/python3.5/site-packages/boto/plugin.py:40: PendingDeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp

.F

FAIL: test_help_text (main.MergeGenes)

Traceback (most recent call last):
File "test_midas.py", line 92, in test_help_text
self.assertTrue(sum(self.retcodes)==0, msg=error)
AssertionError: False is not true :

Failed to execute the command: merge_midas.py genes


Ran 3 tests in 1.984s

FAILED (failures=1)

No output or errors in snps and genes modes

Hello, dear colleagues! Thank you for this amazing tool!

However, I'm having an issue of not getting any output when I run snps and genes modes. In both snps and genes folders there are empty output folders and the size of snps/temp/pangenomes.bam is 0 bytes. Species module works just fine - we obtained species_profile.txt on every sample.

Below you will find a log I got while running genes module on one of the samples. Thanks in advance!

MIDAS: Metagenomic Intra-species Diversity Analysis System
version 1.3.0; github.com/snayfach/MIDAS
Copyright (C) 2015-2016 Stephen Nayfach
Freely distributed under the GNU General Public License (GPLv3)

===========Parameters===========
Command: /home/samoilov/MIDAS/scripts/run_midas.py genes -1 /data5/bio/runs-nprian/pipeline-results/illumina/qc_intermediate/Hp_23_S25_R1.trim.nohum.fastq.gz -2 /data5/bio/runs-nprian/pipeline-results/illumina/qc_intermediate/Hp_23_S25_R2.trim.nohum.fastq.gz -t 20 /data7/bio/runs-samoilov/MIDAS/Hp_23_S25
Script: run_midas.py genes
Database: /data7/bio/BacGenomeSoft/MIDAS_DB/midas_db_v1.2
Output directory: /data7/bio/runs-samoilov/MIDAS/Hp_23_S25
Remove temporary files: False
Pipeline options:
build bowtie2 database of pangenomes
align reads to bowtie2 pangenome database
quantify coverage of pangenomes genes
Database options:
include all species with >=3.0X genome coverage
Read alignment options:
input reads (1st mate): /data5/bio/runs-nprian/pipeline-results/illumina/qc_intermediate/Hp_23_S25_R1.trim.nohum.fastq.gz
input reads (2nd mate): /data5/bio/runs-nprian/pipeline-results/illumina/qc_intermediate/Hp_23_S25_R2.trim.nohum.fastq.gz
alignment speed/sensitivity: very-sensitive
number of reads to use from input: use all
number of threads for database search: 20
Gene coverage options:
minimum alignment percent identity: 94.0
minimum alignment coverage of reads: 0.75
minimum read quality score: 20
minimum mapping quality score: 0
trim 0 base-pairs from 3'/right end of read

Building pangenome database
command: /home/samoilov/MIDAS/bin/Linux/bowtie2-build --threads 20 /data7/bio/runs-samoilov/MIDAS/Hp_23_S25/genes/temp/pangenomes.fa /data7/bio/runs-samoilov/MIDAS/Hp_23_S25/genes/temp/pangenomes

Aligning reads to pangenomes
command: /home/samoilov/MIDAS/bin/Linux/bowtie2 --no-unal -x /data7/bio/runs-samoilov/MIDAS/Hp_23_S25/genes/temp/pangenomes --very-sensitive-local --threads 20 -q -1 /data5/bio/runs-nprian/pipeline-results/illumina/qc_intermediate/Hp_23_S25_R1.trim.nohum.fastq.gz -2 /data5/bio/runs-nprian/pipeline-results/illumina/qc_intermediate/Hp_23_S25_R2.trim.nohum.fastq.gz | /home/samoilov/MIDAS/bin/Linux/samtools view --threads 20 -b - > /data7/bio/runs-samoilov/MIDAS/Hp_23_S25/genes/temp/pangenomes.bam

Error when using run_midas for genes

Hi, Thank you very much for your tool. I was using it on a fastq file I have and successfully ran the species function, however, when I ran the genes function I got the following error (after 6 hours of it running successfully)
`
MIDAS: Metagenomic Intra-species Diversity Analysis System
version 1.2.1; github.com/snayfach/MIDAS
Copyright (C) 2015-2016 Stephen Nayfach
Freely distributed under the GNU General Public License (GPLv3)

Parameters
Command: ./MIDAS/scripts/run_midas.py genes ./data_out/poolA_res/ -1 ../FASTQ_Files/poolA_S17_L005_R1_001.fastq
Script: run_midas.py genes
Database: midas_db_v1.2
Output directory: ./data_out/poolA_res/
Remove temporary files: False
Pipeline options:
build bowtie2 database of pangenomes
align reads to bowtie2 pangenome database
quantify coverage of pangenomes genes
Database options:
include all species with >=3.0X genome coverage
Read alignment options:
input reads (1st mate): ../FASTQ_Files/poolA_S17_L005_R1_001.fastq
input reads (2nd mate): None
alignment speed/sensitivity: very-sensitive
number of reads to use from input: use all
number of threads for database search: 1
Gene coverage options:
minimum alignment percent identity: 94.0
minimum alignment coverage of reads: 0.75
minimum read quality score: 20
minimum mapping quality score: 0
trim 0 base-pairs from 3'/right end of read

Reading reference data
0.56 minutes
0.2 Gb maximum memory

Building pangenome database
total species: 11
total genes: 576240
total base-pairs: 497291760
33.44 minutes
1.16 Gb maximum memory

Aligning reads to pangenomes
finished aligning
checking bamfile integrity
293.88 minutes
1.16 Gb maximum memory

Computing coverage of pangenomes
total aligned reads: 29207158
total mapped reads: 22242407
Traceback (most recent call last):
File "./MIDAS/scripts/run_midas.py", line 699, in
run_program(program, args)
File "./MIDAS/scripts/run_midas.py", line 78, in run_program
genes.run_pipeline(args)
File "/Users/williampascucci/Documents/Research/MIDAS/MIDAS/midas/run/genes.py", line 274, in run_pipeline
pangenome_coverage(args, species, genes)
File "/Users/williampascucci/Documents/Research/MIDAS/MIDAS/midas/run/genes.py", line 151, in pangenome_coverage
normalize(args, species, genes)
File "/Users/williampascucci/Documents/Research/MIDAS/MIDAS/midas/run/genes.py", line 201, in normalize
sp.marker_coverage = np.median(sp.markers.values())
File "//anaconda/lib/python3.5/site-packages/numpy/lib/function_base.py", line 3944, in median
overwrite_input=overwrite_input)
File "//anaconda/lib/python3.5/site-packages/numpy/lib/function_base.py", line 3858, in _ureduce
r = func(a, **kwargs)
File "//anaconda/lib/python3.5/site-packages/numpy/lib/function_base.py", line 4002, in _median
return mean(part[indexer], axis=axis, out=out)
File "//anaconda/lib/python3.5/site-packages/numpy/core/fromnumeric.py", line 2889, in mean
out=out, **kwargs)
File "//anaconda/lib/python3.5/site-packages/numpy/core/_methods.py", line 82, in _mean
ret = ret / rcount
TypeError: unsupported operand type(s) for /: 'dict_values' and 'int'`

I am not sure where I am going wrong, thank you for any help you can provide!

--site_prev option: can the site depth be different

Hi Stephen,

I noticed that the --site_prev option recycles the site_depth parameter for calling SNPs:

--site_prev FLOAT Site has at least <site_depth> coverage in at least <site_prev> proportion of samples.

I wondered if it would make sense to have the option for the site_depth parameter in the --site_prev parameter to be different from the site_depth parameter used to deciding if a SNP is called or not. I could imagine a scenario where even though there aren't sufficient reads to say which SNP alleles are present, there are enough reads to say that a SNP is present. Then, one could use a smaller number for the site_depth within --site_prev than for the --site_depth option.

I'm not sure if this option adds much to the SNP calling. Does this make any sense for improving the quality of the output of SNPs?

Corrupted genome.features files for certain species

Some genomes downloaded from PATRIC had genes with incorrect genome coordinates. For example, gene id 1313.4609.peg.100 from genome id 1313.4609 from species id Streptococcus_pneumoniae_58285.

This genome was identified and removed on the PATRIC site (now exists as genome id 1313.10646)

The MIDAS database should be updated to get the fixed genome files. Additionally, all CDS coordinates in genome.feature files should be validated before inclusion in the MIDAS db

Testing Problem

Dear Midas programmers,

In order to set up MIDAS, I created a python virtual environment on our server, since our server has to use python2.6 for its core programs. I set up the python environmental variables as instructed and running the testing script. However, I encounter the following error message, and do not have any clue to solve it. Could you please help me with this?

(ve) [hulin@HSDMongo1 test]$ ./test_midas.py -vf
test_class (__main__._01_CheckEnv) ... ok
test_class (__main__._02_ImportDependencies) ... ok
test_class (__main__._03_CheckVersions) ... ok
test_class (__main__._04_HelpText) ... ok
test_class (__main__._05_RunSpecies) ... ok
test_class (__main__._06_RunGenes) ... FAIL

======================================================================
FAIL: test_class (__main__._06_RunGenes)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./test_midas.py", line 96, in test_class
    self.assertTrue(code==0, msg=err)
AssertionError: Traceback (most recent call last):
  File "/home/hulin/data/tools/midas/MIDAS/scripts/run_midas.py", line 742, in <module>
    run_program(program, args)
  File "/home/hulin/data/tools/midas/MIDAS/scripts/run_midas.py", line 79, in run_program
    genes.run_pipeline(args)
  File "/home/hulin/data/tools/midas/MIDAS/midas/run/genes.py", line 285, in run_pipeline
    pangenome_coverage(args, species, genes)
  File "/home/hulin/data/tools/midas/MIDAS/midas/run/genes.py", line 148, in pangenome_coverage
    count_mapped_bp(args, species, genes)
  File "/home/hulin/data/tools/midas/MIDAS/midas/run/genes.py", line 174, in count_mapped_bp
    bamfile = pysam.AlignmentFile(bam_path, "rb")
  File "pysam/libcalignmentfile.pyx", line 401, in pysam.libcalignmentfile.AlignmentFile.__cinit__ (pysam/libcalignmentfile.c:5835)
  File "pysam/libcalignmentfile.pyx", line 611, in pysam.libcalignmentfile.AlignmentFile._open (pysam/libcalignmentfile.c:8072)
ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False


----------------------------------------------------------------------
Ran 6 tests in 44.447s

FAILED (failures=1)

Thank you very much.

Eddi

run_midas.py snps produces no PileUp file

At the current commit, running the tutorial with the examples fails when running:
run_midas.py snps midas_output/sample_1 -1 example/sample_1.fq.gz

The script complains that no pileup file can be open when trying to format the mpileup output. It seems that the reason for the missing mpileup file is that the function call to generate it is commented out in snps.py line 288.

When I remove the comment, the example runs through without any troubles.

Many more reads in SNP mapping for 1.3.0 vs 1.2.1

I recently switched from version 1.2.1 to version 1.3.0. I realize there are newer versions of the alignment software in this release so I wasn't surprised when I saw some tiny differences in species identifications. Here is an example of abundances from MIDAS/1.2.1

$ head /scratch/users/surh/midas/hmp/SRS051941/species/species_profile.txt 
species_id	count_reads	coverage	relative_abundance
Corynebacterium_matruchotii_57066	20696	188.980369951	0.291044573929
Actinomyces_sp_62446	3293	28.2154305895	0.0434539734274
Selenomonas_noxia_56778	2150	20.6481371969	0.0317997488018
Leptotrichia_buccalis_61702	2092	20.2566223535	0.0311967852728
Capnocytophaga_granulosa_57613	1895	18.7777777778	0.0289192487775
Haemophilus_parainfluenzae_62468	1888	17.649813272	0.027182095077
Capnocytophaga_sp_58386	1697	15.5149051491	0.0238941693247
Rothia_dentocariosa_57938	1641	13.9604698045	0.021500217124

And here are the corresponding results from mapping with MIDAS/1.3.0

$ head species_profile.txt 
species_id	count_reads	coverage	relative_abundance
Corynebacterium_matruchotii_57066	20696	188.980369951	0.291047096058
Actinomyces_sp_62446	3293	28.2154305895	0.0434543499901
Selenomonas_noxia_56778	2150	20.6481371969	0.0318000243715
Leptotrichia_buccalis_61702	2093	20.2645987198	0.0312093399527
Capnocytophaga_granulosa_57613	1895	18.7777777778	0.0289194993854
Haemophilus_parainfluenzae_62468	1886	17.63975869	0.0271668456529
Capnocytophaga_sp_58386	1712	15.6553593122	0.0241106886749
Rothia_dentocariosa_57938	1642	13.9689718643	0.0215134973917
Capnocytophaga_ochracea_58179	1318	11.9352005916	0.018381303169

You can see that there are some tiny differences that probably don't influence the results.

However, I then tried to obtain SNPs from a species, with the following command:

run_midas.py snps /scratch/users/surh/micropopgen/exp/2017/today3/midas121/SRS051941 -1 /scratch/users/surh/hmp_samples//SRS051941_read1.fastq.bz2 -2 /scratch/users/surh/hmp_samples//SRS051941_read2.fastq.bz2 -t 8 --remove_temp --species_cov 3.0 --mapid 94.0 --mapq 20 --baseq 30 --readq 30 --species_id  Haemophilus_parainfluenzae_62356

I used the exact same command (except for output directory) with both MIDAS/1.2.1 and MIDAS/1.3.0

This is what I get from MIDAS/1.2.1:

$ zcat midas121/SRS051941/snps/output/Haemophilus_parainfluenzae_62356.snps.gz | head
ref_id	ref_pos	ref_allele	alt_allele	ref_freq	depth	count_atcg
FQ312002	1	T	NA	0.0	0	0,0,0,0
FQ312002	2	A	NA	1.0	1	1,0,0,0
FQ312002	3	T	NA	1.0	2	0,2,0,0
FQ312002	4	G	NA	1.0	2	0,0,0,2
FQ312002	5	G	NA	1.0	4	0,0,0,4
FQ312002	6	C	NA	1.0	6	0,0,6,0
FQ312002	7	T	A	0.833333333333	6	1,5,0,0
FQ312002	8	A	NA	1.0	7	7,0,0,0
FQ312002	9	T	NA	1.0	7	0,7,0,0

And this is what I get from MIDAS/1.3.0

$ zcat midas130/SRS051941/snps/output/Haemophilus_parainfluenzae_62356.snps.gz | head
ref_id	ref_pos	ref_allele	depth	count_a	count_c	count_g	count_t
FQ312002	1	T	15	0	0	0	15
FQ312002	2	A	16	16	0	0	0
FQ312002	3	T	17	0	0	0	17
FQ312002	4	G	18	0	0	18	0
FQ312002	5	G	17	0	0	17	0
FQ312002	6	C	20	0	20	0	0
FQ312002	7	T	20	0	0	0	20
FQ312002	8	A	23	23	0	0	0
FQ312002	9	T	21	0	0	0	21

You can see I get dramatically more reads in MIDAS/1.3.0 though it seems like the major allele is consistent.

I just wonder if such big differences are expected. I attach a couple of small read files that seem to reproduce the pattern, which I didn't' see with the test sample in the test directory.

Thanks,
Sur
read1.fastq.gz
read2.fastq.gz

changing the representative genome

Hi Stephen,

I want to change the representative genome used for calling SNPs for one of the species. I just wanted to confirm that the only two files that need to be updated using info from PATRIC are:

midas_db_v1.2/rep_genomes/Bacteroides_uniformis_57318/genome.features.gz
midas_db_v1.2/rep_genomes/Bacteroides_uniformis_57318/genome.fna.gz

Does midas_db_v1.2/genome_info.txt need to be updated too?

Thanks,
Nandita

Tests fails at MergeGenes (merge_midas.py genes)

Hello,

similarly to #20 running test_midas.py fails at this step:

./test_midas.py 
....F.....
======================================================================
FAIL: test_help_text (__main__.MergeGenes)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./test_midas.py", line 132, in test_help_text
    self.assertTrue(sum(self.retcodes)==0, msg=error)
AssertionError: 

Failed to execute the command: merge_midas.py genes 

----------------------------------------------------------------------
Ran 10 tests in 252.652s

FAILED (failures=1)

I am running MIDAS on python 2.7.3

I traced back the error a bit and it probably comes from this call merge_midas.py genes ./genes -i ./sample -t list --species_id Bacteroides_vulgatus_57955 --sample_depth 0.0, when I run it separately it shows the following error message:

[...]
Identifying species
  found 1 species with sufficient high-coverage samples

Merging: Bacteroides_vulgatus_57955 for 1 samples
Traceback (most recent call last):
  File "/work/local_tools/MIDAS/scripts/merge_midas.py", line 352, in <module>
    run_program(program, args)
  File "/work/local_tools/MIDAS/scripts/merge_midas.py", line 339, in run_program
    merge_genes.run_pipeline(args)
  File "/work/local_tools/MIDAS/midas/merge/merge_genes.py", line 109, in run_pipeline
    read_cluster_map(sp, args['db'], args['cluster_pid'])
  File "/work/local_tools/MIDAS/midas/merge/merge_genes.py", line 97, in read_cluster_map
    sp.map[r['centroid_99']] =  r['centroid_%s' % pid]
KeyError: 'centroid_95'

Do you know why this happens?

test.py issue

Hello, I'm trying to test MIDAS and I keep getting the following error when I run the test_midas.py script:
python test_midas.py -vf
test_class (main._01_CheckEnv) ... ok
test_class (main._02_ImportDependencies) ... ok
test_class (main._03_CheckVersions) ... ok
test_class (main._04_HelpText) ... FAIL

======================================================================
FAIL: test_class (main._04_HelpText)

Traceback (most recent call last):
File "test_midas.py", line 84, in test_class
self.assertTrue(code==0, msg=err)
AssertionError: False is not true : b'Traceback (most recent call last):\n File "/home1/02661/traverse/build/MIDAS/scripts/run_midas.py", line 8, in \n from midas import utility\n File "/home1/02661/traverse/build/MIDAS/midas/utility.py", line 7, in \n import io, os, stat, sys, resource, gzip, platform, subprocess, bz2, Bio.SeqIO\nImportError: No module named Bio.SeqIO\n'


Ran 4 tests in 0.518s

FAILED (failures=1)

I have all of the dependencies installed and they work properly when I import them in a python environment. I'm using python3, which might be the issue. Is MIDAS compatible with python3? I tried using 2.7 but I couldn't properly install the dependencies on the cluster I'm using.

Build own database: <genome_id>.genes

Hi,
I would like to use MIDAS with my own genomes as they are not present in the reference database.

I'm able to get all the required input files but I'm not sure how to get the <genome_id>.genes file:

gene_id (CHAR)
scaffold_id (CHAR)
start (INT)
end (INT)
strand (+ or -)
gene_type (CDS or RNA)

Is there any program which can produce that kind of file as as output, or do you have a recommendation about how to get that kind of file?

Thank you very much in advance.

test_midas.py errors

Hi Stephen,

We ran test_midas.py and got the error output attached. Any thoughts?

We're testing run_midas.py species on some of our data now and it seems to be running just fine so far. Think we need to be concerned out this testing error?

Thanks!
Lizzy
test-midas-err.txt

Allow for own/other samtools/bowtie2 executables.

I run a cluster that has its own, well-integrated samtools and bowtie2. It would be convenient if using your supplied tools (thanks for the thought, but..) was optional, and the scripts just looked for the executables on the PATH. Or you supplied static executables.
Your exes are built with Intel's TBB which we don't have, and also some other libs that aren't found. And your GLIBC requirements are too modern for our Devonian kernel.

Pathway annotation issue

Both the KEGG and EC IDs for the reference genome *features.gz files in the MIDAS database are just the GO term IDs repeated:

435590.9.peg.42 NC_009614 57242 58978 - CDS GO:0009044;KEGG:0009044;FIGFAM:FIG00003086;EC:0009044

compare_genes.py vs. gene_sharing.py

There is an inconsistency in the naming of the script regarding the gene content analysis. In the documentation, the script is named gene_sharing.py, while both the script and the documentation markdown document are named compare_genes.py and compare_genes.md, respectively.

One should either fix the documentation or rename the files accordingly.

Deduplication and alleles frequency estimation with strain stracking pipelines

Hi,

I am realtively new to this kind of data so my question might be naive. I am extensively using the MIDAS pipeline for my project and I came to wonder how the deduplication step when dealing with raw sequencing data (gut metagenomes) could affect our ability to accurately estimate alleles frequency.
There are a bunch of strain tracking pipeline similar to MIDAS coming out but I don't think I have seen this issue being explored by any of them (althgouh I have not read those other papers in details yet).

I do not have a strong intuition about how this could affect it but an accurate allele frequency estimation relies on having one read per original DNA molecule right ('one genome per bacteria').
How big could the issue be and should this be check by simulations or something?

Thanks,
Camille

Result Inquiry

Hi (Thanks again for your software),
In regards to the output of the read mapping. For the output for each species/strain in the .genes file, are adjacent id's (ie. 1182695.3.peg.1168, 1182695.3.peg.1169, 1182695.3.peg.1170) corresponding to adjacent positions on the genome of the species? If so we were examining the results of the read mapping to pangenomes for a few samples, we saw several cases where one position had thousands of read mapped to it but an adjacent position would have no reads. We thought that was a little odd in comparison to most other positions where several adjacent positions would have smoothly distributed reads.
Had you experienced any similar cases/results? If so, did you have any methods for accounting for those cases?
Thanks!

snps.py

Dear Snayfach

Thank you for your very useful tool. I was working with MIDAS trying to do a custom database. After several attempts I got it. Now, I am running "run_midas.py" using my custom database, however, I got a error and the output is wrong. This is the commands that I am using:
run_midas.py snps MIDAS_outdir -1 R1.fastq -2 R2.fastq -d My_Database
I was wondering if something is wrong with my custom database, because when the midasdb is used the output is correct and there is not error. Another possible cause of the error could be that something is happening with "mpileup" Do you have an idea about this issue? Thanks.

MIDAS: Metagenomic Intra-species Diversity Analysis System
version 1.2.1; github.com/snayfach/MIDAS
Copyright (C) 2015-2016 Stephen Nayfach
Freely distributed under the GNU General Public License (GPLv3)

===========Parameters===========
Command: MIDAS-1.2.2/scripts/run_midas.py snps MIDAS_outdir -1 R1.fastq -2 R2.fastq -d My_Database
Script: run_midas.py snps
Database: My_Database
Output directory: MIDAS_outdir
Remove temporary files: False
Pipeline options:
build bowtie2 database of genomes
align reads to bowtie2 genome database
use samtools to generate pileups and call SNPs
Database options:
include all species with >=3.0X genome coverage
Read alignment options:
input reads (1st mate): R1.fastq
input reads (2nd mate): R2.fastq
alignment speed/sensitivity: very-sensitive
number of reads to use from input: use all
number of threads for database search: 1
SNP calling options:
minimum alignment percent identity: 94.0
minimum mapping quality score: 20
minimum base quality score: 30
minimum read quality score: 20
trim 0 base-pairs from 3'/right end of read

Reading reference data
0.01 minutes
0.05 Gb maximum memory

Building database of representative genomes
total genomes: 5
total contigs: 860
total base-pairs: 25344819
0.76 minutes
0.22 Gb maximum memory

Mapping reads to representative genomes
finished aligning
checking bamfile integrity
52.42 minutes
0.88 Gb maximum memory

Running mpileup
3.96 minutes
0.88 Gb maximum memory

Formatting output
Traceback (most recent call last):
File "/MIDAS-1.2.2/scripts/run_midas.py", line 699, in
run_program(program, args)
File "/MIDAS-1.2.2/scripts/run_midas.py", line 81, in run_program
snps.run_pipeline(args)
File "/MIDAS3/MIDAS-1.2.2/midas/run/snps.py", line 295, in run_pipeline
format_pileup(args, species, contigs)
File "/MIDAS-1.2.2/midas/run/snps.py", line 126, in format_pileup
contig = contigs[sp.contigs[sp.i]]
IndexError: list index out of range

consistent species misassignment

Hello,

I'm running run_midas.py species on a simulated metagenomics dataset to understand the biases I need to be aware of when I run my real data set, and I've found several consistently incorrect assignments that I'm hoping you can explain. I'm running MIDAS v1.0.0, with the default database, as

run_midas.py species MIDAS/anc100e1 -1 anc100e1.all.fastq &> anc100e1.midasout

I have a handful of species that always appear in my species_profile.txt output that aren't in my simulated sample set. I check which other species share the same species_id by getting the genome_ids from genome_to_species.txt and then the genome_name from genome_taxonomy.txt. In each case, one of the other genomes is a species in my simulated dataset that wasn't included in the species_profile.txt.

For example, I get Phocaeicola abscessus in my species_profile.txt, but it isn't in my simulated dataset. It shares species_id 52822 with Bacteriodetes oral taxon 272, which is in my dataset, but never appears in my species_profile.txt. Another example is getting Candidatus Prevotella in my species_profile.txt instead of Prevotella oral taxon 317, both species_id 58138.

Can you explain why one species instead of another is chosen when several share the same species_id? And is there any way to change the assignment preference?

Thanks,
Irina

Error: no species sastisfied your selection criteria.

Hi, I'm trying to run the program run_midas.py genes and snps and I get the error message "Error: no species sastisfied your selection criteria". I ran the "species" program without any problem and also read the help pages but still don't know how to fix this error. I attached the log.txt file of the genes program. Thank you.
log.txt

Building custom database - input file formats?

Hey Stephen,

I'm working with my students to give your software a whirl and had a question about building a custom database. In your description of input files, we had a question about file formats & content:

<genome_id>.fna Genomic DNA sequence in FASTA format
<genome_id>.faa Genomic DNA sequence in FASTA format
<genome_id>.ffn Protein sequences in FASTA format

Should the .faa file be the coding sequences (in amino acids) and the .ffn be coding sequences in nucleic acids? I wasn't sure if there might be a type in your instructions above...

Also - if we have only one genome per species, are there any other considerations we should keep in mind or will it simply skip the pangenome step for a species with a single genome (as reference din the mapfile)?

Thanks!
Lizzy

query_coverage requires obscure format

Thank you for this wonderful tool. My test runs failed due to the requirement of species.py function query_coverage for the sequence header to end in "_" followed by the length of the sequence. I understand that m8 output format does not give the length of the hit sequence, but what the function requires is non-standard for FASTA or FASTQ and will be an impediment to most users. Please figure out another way to do this. I wrote a stupid python script to do so for FASTA (below); the analogy is easy for FASTQ. I don't think this is a good solution because it produces redundant very large files, and neither is it a good solution to edit user's original files. It may be best to grep out the line from the FASTA (or 3rd from the FASTQ) following the hit sequence header, measure the length.

#/usr/bin/python
from sys import argv
f=open(argv[1])
line1 = f.readline().strip()
line2 = f.readline().strip()
while line1:
	print line1.split()[0]+'_'+str(len(line2))
	print line2
	line1 = f.readline().strip()
	line2 = f.readline().strip()

Sample depth parameter for CNV merging

Hi Stephen,

I'm confused how --sample-depth works for CNV merging. Suppose I set the sample-depth to be 10, but there is only one gene that doesn't meet this requirement. Does that one gene's lack of coverage mean that we throw out the whole sample? Or is it just that gene that is removed?

Thanks,
Nandita

return "permission denied" after install MIDAS in sever

Hi,
I sucessfully installed MIDAS in my local PC, I do not get any error when I runed "run_midas.py". However, after I installed it in a cluster sever with --prefix='/.local' option, I get 'permission denied' error when I runned "run_midas.py". Anything was wrong?

test_midas.py issue with bowtie2 and samtools binaries

I recently attempted to install MIDAS on our Linux cluster. When I tested the installation ('test_midas.py -vf'), I encountered the following issue in which the test script would fail on the RunGenes step:

[04:10:26] scg4-ln02:~/fiona/tools/MIDAS $ python test/test_midas.py -vf
test_class (__main__._01_CheckEnv) ... ok
test_class (__main__._02_ImportDependencies) ... ok
test_class (__main__._03_CheckVersions) ... ok
test_class (__main__._04_HelpText) ... ok
test_class (__main__._05_RunSpecies) ... ok
test_class (__main__._06_RunGenes) ... FAIL

======================================================================
FAIL: test_class (__main__._06_RunGenes)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_midas.py", line 96, in test_class
    self.assertTrue(code==0, msg=err)
AssertionError: 
Error encountered executing:
/home/tamburin/fiona/tools/MIDAS/bin/Linux/bowtie2 --no-unal -x ./sample/genes/temp/pangenomes -u 100 --very-sensitive-local --threads 1 -q -U ./test.fq.gz | /home/tamburin/fiona/tools/MIDAS/bin/Linux/samtools view --threads 1 -b - > ./sample/genes/temp/pangenomes.bam

Error message:
/home/tamburin/fiona/tools/MIDAS/bin/Linux/samtools: error while loading shared libraries: libbz2.so.1.0: cannot open shared object file: No such file or directory
/home/tamburin/fiona/tools/MIDAS/bin/Linux/bowtie2-align-s: symbol lookup error: /home/tamburin/fiona/tools/MIDAS/bin/Linux/bowtie2-align-s: undefined symbol: gzopen64
(ERR): bowtie2-align exited with value 127




----------------------------------------------------------------------
Ran 6 tests in 64.456s

FAILED (failures=1)

I was eventually able to fix this by removing the included samtools and bowtie2 binaries and replacing them with symlinks to the installations on our cluster. If others are experiencing same error, it appears that installing your own versions is a workaround. It would be helpful for the developers to look into why this is happening and create a fix.

Error initializing species with --call flag

Command:
run_midas.py genes run_midas/$SPID -d $BSCRATCH/projects/dc4/midas_db -n 1000 -1 $FASTQ --species_cov=0.001 --call

Error:

  File "/global/projectb/scratch/snayfach/tools/MIDAS-dev/scripts/run_midas.py", line 710, in <module>
    run_program(program, args)
  File "/global/projectb/scratch/snayfach/tools/MIDAS-dev/scripts/run_midas.py", line 79, in run_program
    genes.run_pipeline(args)
  File "/global/projectb/scratch/snayfach/tools/MIDAS-dev/midas/run/genes.py", line 249, in run_pipeline
    species = initialize_species(args)
  File "/global/projectb/scratch/snayfach/tools/MIDAS-dev/midas/run/genes.py", line 47, in initialize_species
    species[id] = Species(line.rstrip())
UnboundLocalError: local variable 'id' referenced before assignment

different sep in write_abundance and read_abundance

Cluster results are written as tab separated, but in read_abundance, sep is not specified, which will cause some parsing problems as species_name contains space.

Here is the complain when running the example:
Traceback (most recent call last):
File "../scripts/run_phylo_cnv.py", line 534, in
run_program(program, args)
File "../scripts/run_phylo_cnv.py", line 94, in run_program
genes.run_pipeline(args)
File "/SOMEWHERE/.local/lib/python2.7/site-packages/phylo_cnv/genes.py", line 220, in run_pipeline
genome_clusters = species.select_genome_clusters(args)
File "/SOMEWHERE/.local/lib/python2.7/site-packages/phylo_cnv/species.py", line 226, in select_genome_clusters
cluster_abundance = read_abundance(args['profile'])
File "/SOMEWHERE/.local/lib/python2.7/site-packages/phylo_cnv/species.py", line 217, in read_abundance
dict[values[0]][field[0]] = field1
ValueError: invalid literal for int() with base 10: 'phocae'

utility.py missing bz2

Add import bz2 to stream_seqs.py and utility.py, otherwise bz2 files not read in.

Should put an error catcher here so that if files are not successfully imported, no results are generated (when I ran with .bz2 files the output had 0% total abundance but no error message).

Cannot use MIDAS with python2 conda environment

I use Python 3.6.2 for my main environment and created a new conda environment for Python 2.7.13 in which I installed MIDAS. I am trying to run MIDAS but I am encountering the following errors regarding libbz2.so.1.0, gzopen64 and bowtie2-align.

Do any of these look familiar?

Added this to my bash profile:

# Python2 Alias (I also have an environment called `python2`)
alias python2='/usr/local/devel/ANNOTATION/jespinoz/anaconda/envs/python2/bin/python'

# MIDAS
export PYTHONPATH=$PYTHONPATH:/usr/local/devel/ANNOTATION/jespinoz/anaconda/envs/python2/lib/python2.7/site-packages/MIDAS
export PATH=$PATH:/usr/local/devel/ANNOTATION/jespinoz/anaconda/envs/python2/lib/python2.7/site-packages/MIDAS/scripts
export MIDAS_DB=/usr/local/devel/ANNOTATION/jespinoz/db/midas_db_v1.2

I ran this command after activating my python2 environment:

#!/bin/bash
echo S-1504-86.B
python run_midas.py species /usr/local/projdata/0497/projects/CariesBiome/jespinoz/pt_II/metagenome/midas_output/S-1504-86.B_midas -1 /usr/local/projdata/0497/projects/CariesBiome/metagenomics/run1/subsample-0.25_assembly/fastq_files/S-1504-86.B_RD1_kneaddata_paired_1_sub-0.25.fastq.gz -2 /usr/local/projdata/0497/projects/CariesBiome/metagenomics/run1/subsample-0.25_assembly/fastq_files/S-1504-86.B_RD1_kneaddata_paired_2_sub-0.25.fastq.gz -t 2
python run_midas.py genes /usr/local/projdata/0497/projects/CariesBiome/jespinoz/pt_II/metagenome/midas_output/S-1504-86.B_midas -1 /usr/local/projdata/0497/projects/CariesBiome/metagenomics/run1/subsample-0.25_assembly/fastq_files/S-1504-86.B_RD1_kneaddata_paired_1_sub-0.25.fastq.gz -2 /usr/local/projdata/0497/projects/CariesBiome/metagenomics/run1/subsample-0.25_assembly/fastq_files/S-1504-86.B_RD1_kneaddata_paired_2_sub-0.25.fastq.gz -t 2
python run_midas.py snps /usr/local/projdata/0497/projects/CariesBiome/jespinoz/pt_II/metagenome/midas_output/S-1504-86.B_midas -1 /usr/local/projdata/0497/projects/CariesBiome/metagenomics/run1/subsample-0.25_assembly/fastq_files/S-1504-86.B_RD1_kneaddata_paired_1_sub-0.25.fastq.gz -2 /usr/local/projdata/0497/projects/CariesBiome/metagenomics/run1/subsample-0.25_assembly/fastq_files/S-1504-86.B_RD1_kneaddata_paired_2_sub-0.25.fastq.gz -t 2

Encountered this error:

Error encountered executing:
/usr/local/devel/ANNOTATION/jespinoz/anaconda/envs/python2/lib/python2.7/site-packages/MIDAS/bin/Linux/bowtie2 --no-unal -x /usr/local/projdata/0497/projects/CariesBiome/jespinoz/pt_II/metagenome/midas_output/S-1504-86.B_midas/snps/temp/genomes --very-sensitive --threads 2 -q -1 /usr/local/projdata/0497/projects/CariesBiome/metagenomics/run1/subsample-0.25_assembly/fastq_files/S-1504-86.B_RD1_kneaddata_paired_1_sub-0.25.fastq.gz -2 /usr/local/projdata/0497/projects/CariesBiome/metagenomics/run1/subsample-0.25_assembly/fastq_files/S-1504-86.B_RD1_kneaddata_paired_2_sub-0.25.fastq.gz | /usr/local/devel/ANNOTATION/jespinoz/anaconda/envs/python2/lib/python2.7/site-packages/MIDAS/bin/Linux/samtools view -b - --threads 2 | /usr/local/devel/ANNOTATION/jespinoz/anaconda/envs/python2/lib/python2.7/site-packages/MIDAS/bin/Linux/samtools sort - --threads 2 -o /usr/local/projdata/0497/projects/CariesBiome/jespinoz/pt_II/metagenome/midas_output/S-1504-86.B_midas/snps/temp/genomes.bam

Error message:
/usr/local/devel/ANNOTATION/jespinoz/anaconda/envs/python2/lib/python2.7/site-packages/MIDAS/bin/Linux/samtools: error while loading shared libraries: libbz2.so.1.0: cannot open shared object file: No such file or directory
/usr/local/devel/ANNOTATION/jespinoz/anaconda/envs/python2/lib/python2.7/site-packages/MIDAS/bin/Linux/samtools: error while loading shared libraries: libbz2.so.1.0: cannot open shared object file: No such file or directory
/usr/local/devel/ANNOTATION/jespinoz/anaconda/envs/python2/lib/python2.7/site-packages/MIDAS/bin/Linux/bowtie2-align-s: symbol lookup error: /usr/local/devel/ANNOTATION/jespinoz/anaconda/envs/python2/lib/python2.7/site-packages/MIDAS/bin/Linux/bowtie2-align-s: undefined symbol: gzopen64
(ERR): bowtie2-align exited with value 127

Issue with strain phylogeny

Hi Stephen,

Thank you for your program!

I am trying to use MIDAS to construct a detailed strain phylogeny for a single species, and am running into a bit of trouble in the final steps. I created a custom database using a pan-genome reference that had previously been assembled and ran run_midas.py species and run_midas.py snps individually for three metagenomic samples (each around 2GB). I merged the snp results for the three samples and then ran call_consensus.py. When I used the consensus.fa file output to construct a phylogenetic tree on FastTree, it produced a tree with only three leaves, treating each metagenomic output as a strain, rather than identifying many different strains within each metagenome. Is this the way that the program is supposed to work, or could settings be adjusted to produce a more detailed phylogeny?

Thanks for your help,

Kyle Campbell

The database is not found when running the command testing MIDAS

I am trying to install and run the command to test MIDAS for the first time. I have installed MIDAS in macOS Sierra version 10.12.6. Every time I run the sample test it cannot find the database midas_db_v1.2 even though I have downloaded it and unpack tarball like the following link suggested: https://github.com/snayfach/MIDAS/blob/dev/docs/ref_db.md

I have tried everything I can think of, I am very new about programming so I apologize if this is a very silly question.

Thanks for your help in advance!

screen shot 2018-03-29 at 4 01 26 pm

A suggestion to use #!/usr/bin/env python instead of #!/usr/bin/python

It is better to use #!/usr/bin/env python than #!/usr/bin/python, which can give user maximum flexibility to choose the interpreter to use. For example, in our HPC, there exists several versions of python and the default version (/usr/bin/python) is quite old. By loading other versions of python, we can not simply run run_phylo_cnv.py (after appending it to the PATH), but need to run "python scripts_dir/run_phylo_cnv.py".

Bowtie error in snp calling pipeline

I'm getting the following error when trying to run the run_midas snps pipeline:

Error encountered executing:
apps/MIDAS/bin/Linux/bowtie2-build ERR866577_species/snps/temp/genomes.fa ERR866577_species/snps/temp/genomes

Error message:
Warning: Empty fasta file: 'ERR866577_species/snps/temp/genomes.fa'
Warning: All fasta inputs were empty
Error: Encountered internal Bowtie 2 exception (#1)
Command: bowtie2-build --wrapper basic-0 ERR866577_species/snps/temp/genomes.fa ERR866577_species/snps/temp/genomes

I get this with both running the automated pipeline and with specifying a taxon id. The species inference appears to work perfectly fine. Any help would be great.

Thanks!

Installing MIDAS on Mac OS X 10.11 and later

This is not a question but rather a comment to help those in the same situation.

In El Capitan and later, due to System Integrity Protection, the default instructions to install dependencies will not work*. Instead, one must first configure Mac OS X to look for Python packages outside of /System/Library. See instructions from mfripp at StackExchange.

After that, install the dependencies with this command:

sudo -H pip install --upgrade numpy biopython pysam pandas

*the reason why the default instructions will not work is because the version of numpy in Mac OS X is 1.8.0rc1, which doesn't get parsed by the version checker correctly and causes an error. This in itself is potentially a bug, but this issue between pip and SIP occurs with many other programs as well (e.g., Qiime).

RunSpecies fail during test, could not execute bowtie2 binary

Trying to set up MIDAS on a server running CentOS with updated version of bowtie2 (2.3.2) on the server. Can't get past this failure.

[~]$ ./test_midas.py -vf
test_class (main._01_CheckEnv) ... ok
test_class (main._02_ImportDependencies) ... ok
test_class (main._03_CheckVersions) ... ok
test_class (main._04_HelpText) ... ok
test_class (main._05_RunSpecies) ... FAIL

======================================================================
FAIL: test_class (main._05_RunSpecies)

Traceback (most recent call last):
File "./test_midas.py", line 90, in test_class
self.assertTrue(code==0, msg=err)
AssertionError:
Error: could not execute bowtie2 binary: /~/MIDAS/bin/Linux/bowtie2
(exited with error code 127)
To solve this issue, follow these steps:

  1. Go to https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.3.2
  2. Download bowtie2-2.3.2-linux-x86_64.zip
  3. Unpack the software on your system
  4. Copy the new bowtie2 binaries to: /usr/local/genome/bin/MIDAS/bin/Linux

Ran 5 tests in 2.078s

FAILED (failures=1)

> 2 alleles for a single sample

Hi Stephen,

If a single sample has >2 alleles at a single site, is there any way to obtain information about all the alleles from MIDAS output? E.g. let's say at a single site in a single sample there are 10 reads with a T, 10 reads with a C, and 5 reads with an A. Can we obtain info about all the nucleotides present or only the major ones (T and C)?

Help with genes error: libchtslib

Hello,
I have been trying to get MIDAS, installed by a former lab member, to run. So far I think that I have run_midas.py species running but when I then try run_midas.py genes on the samples I just ran through species I receive the following error
Traceback (most recent call last):
File "/Users/mcleanlab/Tools/MIDAS/scripts/run_midas.py", line 757, in
run_program(program, args)
File "/Users/mcleanlab/Tools/MIDAS/scripts/run_midas.py", line 79, in run_program
genes.run_pipeline(args)
File "/Users/mcleanlab/Tools/MIDAS/midas/run/genes.py", line 286, in run_pipeline
pangenome_coverage(args, species, genes)
File "/Users/mcleanlab/Tools/MIDAS/midas/run/genes.py", line 149, in pangenome_coverage
count_mapped_bp(args, species, genes)
File "/Users/mcleanlab/Tools/MIDAS/midas/run/genes.py", line 173, in count_mapped_bp
import pysam
File "/Users/mcleanlab/Downloads/pysam-master/pysam/init.py", line 5, in
from pysam.libchtslib import *
ImportError: No module named libchtslib
I have looked for libchtslib and found libchtslib.pxd, libchstlib.c and libchtslib.o deep inside the pysam-master folder. However, even adding the corresponding folders to the python path did not correct the issue.
I am looking for help. Hopefully I'm just confused and the answer is simple.

thank you for your time
Erik Hendrickson

Running in Python 2 conda environment works for `species` and `genes` modules but not `snps`.

I'm trying to run the MIDAS pipeline using a Python 2 environment configured for use with MIDAS. My main environment is a Python 3.6 environment and it looks like the correct Python version is being called in my Python 2 environment but one stage is calling multiprocessing from Python 3.6.

Do you know if there are any patches I can do to bypass this and call the correct multiprocessing module?

(python2) -bash-4.1$ cat snps_18.e8829501
Traceback (most recent call last):
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda/envs/python2/lib/python2.7/site-packages/MIDAS/scripts/run_midas.py", line 757, in <module>
    run_program(program, args)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda/envs/python2/lib/python2.7/site-packages/MIDAS/scripts/run_midas.py", line 82, in run_program
    snps.run_pipeline(args)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda/envs/python2/lib/python2.7/site-packages/MIDAS/midas/run/snps.py", line 301, in run_pipeline
    pysam_pileup(args, species, contigs)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda/envs/python2/lib/python2.7/site-packages/MIDAS/midas/run/snps.py", line 228, in pysam_pileup
    aln_stats = utility.parallel(species_pileup, argument_list, args['threads'])
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda/envs/python2/lib/python2.7/site-packages/MIDAS/midas/utility.py", line 101, in parallel
    return [r.get() for r in results]
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda/envs/python2/lib/python2.7/site-packages/MIDAS/midas/utility.py", line 101, in <listcomp>
    return [r.get() for r in results]
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
    put(task)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda/lib/python3.6/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
TypeError: cannot serialize '_io.TextIOWrapper' object

genome_info.txt file not created by build_midas_db.py

Causes the following error when running merge_midas.py snps:

Traceback (most recent call last):
  File "/global/projectb/scratch/snayfach/tools/MIDAS-dev/scripts/merge_midas.py", line 441, in <module>
    run_program(program, args)
  File "/global/projectb/scratch/snayfach/tools/MIDAS-dev/scripts/merge_midas.py", line 431, in run_program
    snps.run_pipeline(args)
  File "/global/projectb/scratch/snayfach/tools/MIDAS-dev/midas/merge/snps.py", line 382, in run_pipeline
    species_list = merge.select_species(args, dtype='snps')
  File "/global/projectb/scratch/snayfach/tools/MIDAS-dev/midas/merge/merge.py", line 161, in select_species
    species = init_species(samples, args, dtype)
  File "/global/projectb/scratch/snayfach/tools/MIDAS-dev/midas/merge/merge.py", line 130, in init_species
    genome_info = read_genome_info(args['db'])
  File "/global/projectb/scratch/snayfach/tools/MIDAS-dev/midas/merge/merge.py", line 100, in read_genome_info
    for r in csv.DictReader(open(path), delimiter='\t'):
IOError: [Errno 2] No such file or directory: '/global/projectb/scratch/snayfach/projects/dc4/midas_db/genome_info.txt'

How is genes_depth.txt computed?

Hi Stephen,

I was wondering if you could help answer a question about genes_depth.txt. I'm trying to figure out how exactly depth (or, coverage in the field below) is computed for each gene. I have pasted an example case below for a sample from the HMP and the corresponding MIDAS output for the genes (pre-merge format).

sample ID=700015181
numbp=570 (obtained from the pangenome for B. uniformis)

gene_id count_reads coverage copy_number
1235787.3.peg.10 25 4.33859649123 0.0168362540388

I went through count_mapped_bp def in genes.py and I'm guessing that you divide the number of aligned bps by the total number of bps in a gene to get coverage. Is this correct?

genes[gene_id].depth += aln_len/float(gene_len)

Is count_reads the total number of reads aligning to a gene, regardless of where in the gene it aligns (i.e. if only part of a read aligns to the gene and the other part is hanging over covering a neighboring gene, is this read still counted in count_reads)?

Thanks,
Nandita

Trouble with uniquely mapped reads

I am trying to just learn how to use the tool and am having issues with the first step (species taxonomy) when running samples downloaded from NCBI. I ran the test script and went through the tutorial without issue. I downloaded one of the sequences from the Hadza microbiome study and tried to run MIDAS with that. At the first step I did not get any uniquely mapped reads and therefore can't seem to get the species taxonomy and abundance for the sample. I've run in to this with other sequences I've tried as well. Here is the result I got:

sbrgxps@sbrgxps-XPS-8910:~/MIDAS$ run_midas.py species SRR1927149 -1 ../rampelliseqs/SRR1927149.fq

MIDAS: Metagenomic Intra-species Diversity Analysis System
version 1.2.1; github.com/snayfach/MIDAS
Copyright (C) 2015-2016 Stephen Nayfach
Freely distributed under the GNU General Public License (GPLv3)

===========Parameters===========
Command: /home/sbrgxps/MIDAS/scripts/run_midas.py species SRR1927149 -1 ../rampelliseqs/SRR1927149.fq
Script: run_midas.py species
Database: /home/sbrgxps/MIDAS/midas_db_v1.2
Output directory: SRR1927149
Input reads (1st mate): ../rampelliseqs/SRR1927149.fq
Input reads (2nd mate): None
Remove temporary files: False
Word size for database search: 28
Minimum mapping alignment coverage: 0.75
Number of reads to use from input: use all
Number of threads for database search: 1

Aligning reads to marker-genes database
17.92 minutes
0.73 Gb maximum memory

Classifying reads
total alignments: 25486
uniquely mapped reads: 0
ambiguously mapped reads: 0
0.0 minutes
0.74 Gb maximum memory

Estimating species abundance
total marker-gene coverage: 0.0
0.0 minutes
0.74 Gb maximum memory

Did I install something incorrectly or is there an issue with how I am running the scripts? I am new to using metagenomics analysis tools so I wasn't sure where to start with troubleshooting.

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.