biocore / emp Goto Github PK

Code repository of the Earth Microbiome Project.

Home Page: http://www.earthmicrobiome.org

License: BSD 3-Clause "New" or "Revised" License

Python 0.92% CSS 0.01% OpenEdge ABL 12.49% HTML 2.23% Jupyter Notebook 83.64% Shell 0.11% JavaScript 0.01% TeX 0.02% R 0.55% Ruby 0.01%

emp's People

Contributors

Stargazers

Watchers

emp's Issues

compute observed species for all otu tables

Update on metadata and completed studies in database.

Jesse Stombaugh, Sarah Owens, Giancarlo Galindo and Gail had a conference today (7/23) to discuss the current status of studies in the database. Sarah will update us this week with studies that have been sequenced but where we don't have the sequence data for processing. Both Sarah and Giancarlo will actively pursue metadata for studies that have been processed but have no useful metadata. Gail suggested that the yrequest the information in whatever for the participants have it then send it on to Gail for addition and upload to the study site. Greg will be given access to the current status google.doc that has a summary of all the studies in the database and a list of those that have been sequenced. We are all agreed that August 1st is a deadline for Greg to start the analysis. Do other study participants (Jack, Janet, Rob, Noah Antonio) need access to the google doc?

Post and improve SOPs

Post the iPython notebooks folks have been using for data processing. We can then work to improve their documentation and verify reproducibility.

Begin writing overview of processing steps in wiki.

Open questions:

How should we divide up the various pipelines? e.g. Should demultiplexing and quality filtering be in the same or different notebooks?
How do we verify we are using the same software environments? Maybe have a software deployment wiki page or software deployment notebook?

driving factors of interest

It would be good to have a list of driving factors we expect to be important in differentiating samples, based on previous work (e.g. Lozupone and Knight's finding about salinity). We can test for these directly. Note: we can also see what falls out from machine learning, etc. A few to start with:

salinity (saline vs. non-saline)
life style (host-associated vs. free-living)
env_matter

compile list of EMP samples that will be included in this study

Adding Golay seqs to 16S reverse primers for dual barcoding

Hello. Our lab is trying to develop 16S sequencing protocol and we'd like to have barcodes on both the forward and reverse primers. We are using the first 25 forward primers (515f) with golay barcodes and I was wondering if we can just copy the next 25 golay barcodes from the forward primers and paste them in the appropriate place within the reverse primer sequence to have dual barcoding.

If someone can advice on how to develop barcoded reverse primers for 16S, that would be great.

Thank you!

map of the earth containing location of all samples (and sub-maps colored by sample site)

comparing spatial and temporal scales

From Rob:

It would be really good if we could start delivering on the "comparing spatial and temporal scales" idea in a more specific way. Do we have datasets that support this well currently? For example, can we look at the effects of pH or temperature or salinity etc. within and between ecosystem types?

are there any shared OTUs between all ecosystems surveyed

load final reference sequence collection to github

taxonomy assignment on all 'new OTUs'

visualization of the environmental parameters gradients

From Jack:

I would also like to have a visual representation of the environmental gradients we have for each ecosystem. i.e. I can imagine a figure like the attached (sorry in my hotel room) - where we represent from a gradient of 0-100 the coverage of the gradients we have already surveyed. 0 would be the lowest possible (sensible) limit for that variable and 100 the highest. So for temp we would go for -56C to +120C, and for pH from 1 to 14 - or something like that. I could have some one start creating this if everyone agrees it is a good idea.

16S reference database effects

Choice of 16S database affects results to some degree. Main choices are RDP, SILVA, and Greengenes.

Which is more representative for environmental microbes, for host-associated microbes?
How do the results (downstream analyses) change with different databases?
Silva has better representation -- Greengenes team are working to update accordingly.

otu table 940 has file size 0

need to figure out why...

website: Defining the Tasks

Need to update. Current text from 2012:

There will be four key output from the EMP:

Gene Atlas (GA) is a centralized repository and database for all information acquired during this study. This resource will follow the KBASE initiative and provide a searchable format to hold all information regarding annotation, environmental metadata and sequence. This will be a repository for all information, both known and unknown, the latter is also known as Dark Matter.
Earth Microbiome Assembled Genomes (EM-AG) will encompass all genomes assembled from EMP data, which will be annotated using an automated pipeline and provided in public repositories and a KBASE derived analytical platform. This will enable comparative genomic analysis against all known and EMP-derived genomes and metagenomes.
Earth Microbiome Visualization Portal (EM-VIP) will engage with expert in interactive visualization software to synthesize our unique vision into a format accessible to all. In here we will view the earth in microbial space, describing environmental parameter space and genomic functional space, to allow the interrogation of EMP data and the discovery of new ecological theory.
Earth Microbiome Metabolic Reconstruction (EMMR) will be based on metagenomic metabolome description and prediction software such as modelSEED and Relative Metabolic Flux (RMF) we will describe changes in metabolites and metabolite profiles through time and biogeographic space. This will be used to produce descriptions regarding metabolite production in specific biomes, providing another metric against which to refine biome descriptions.

total number of OTUs

This is a contentious question, but it might be worth addressing in the paper. Here is what I wrote to Ken Locey about it (July 15, 2016):

We have picked OTUs using a variety of methods for the first ~27,700 samples in the EMP. It's hard to define a microbial species for both technical and philosophical reasons, but the best approximation we have is probably OTUs. Closed-reference methods give a fairly conservative number (mapped to a reference set clustered at 97% ID along the full-length 16S rRNA; we are sequencing only the V4 region). Open-reference (closed-reference plus de novo clustering of remaining sequences) gives a much larger number. Perhaps the fairest number is the number of unique sequences as determined using an error-correction method called Deblur (https://github.com/biocore/deblur). Here are the OTU counts we get using those different methods:

Closed-reference (Greengenes): 69,901
Closed-reference (Silva): 126,730
Open-reference (Greengenes): 11,014,898
Deblur (unique 100-bp sequences): 351,728

I think the most defensible statement would be: There are >350,000 unique (error-corrected) 100-bp 16S rRNA V4 region sequences identified in the EMP dataset to date.

PCoA plot of all samples

this may be challenging as it requires all samples in a single OTU table, and we're still having issues with tables of that size despite a lot of improvements to the biom-format objects

Inventory of samples pledged, received, processed, and released (including examples of projects)

random forests classifier

Some examples to consider:

EMPO categories -- How well can we predict the EMPO category at multiple levels? (Zech did this already)
ENVO categories
Soil geography -- Can we predict where a soil is from based on its microbiome? (need to control for soil type, pH, other confounding factors?) @gilbertjack ask about this.

Fix metadata for the studies that still don't have it, or delete those studies

It is critical that at launch we can say we are providing a set of studies with good metadata. We therefore need to set a hard deadline for studies with sufficient metadata to actually understand the study in isolation or in combination, then delete the studies that don't meet this criterion (rather than continuing to attempt to extract metadata from people who won't provide it and having the project look bad overall).

Sequences dropped by split libraries

Check drop of sequences at the split library stage and if that has any correlation with env or something else.

The easiest that comes to mind to do this is to change params in split libraries to only discard sequences based on barcode and then compare the numbers per sample ...

animated PCoA of the specific timeseries we have within the context of full PCoA plot

abundance figure

"Right now we are working only through the 16S - a cool figure would be like the one you did for the L4 6 year paper but for each biome - would be great to show the abundance for each taxon in such a tree - this is one suggestion" cite from Jack

Basically you want an abundance graph for each data set, or one figure with pie charts for each node showing the proportion in each data set?

compile mapping file for all samples

A mapping file should be exported from the QIIME DB for all samples that are being included. The studies that are being included are indicated in the spreadsheet here:

https://docs.google.com/spreadsheet/ccc?key=0AvglGXLayhG7dE5KWlhKN0VhaG90aWZrd090djlfUHc#gid=0

Identify good examples of studies to compare driving factors

We need to identify cases with matched metadata on chemistry so we can ask e.g. if pH and salinity cause similar shifts in community composition across sites in the same ecosystem type, and across different ecosystem types. This depends on having the metadata in the database in a form where it can be investigated easily.

Estimates of total # taxa in each environment

Is the quality filtering good enough to support analyses of how many taxa we expect to find in each environment with infinite sampling, or will be we embarrassed by the usual issues that plague rare biosphere work?

this is a test: you should get email notification of this issue

Qiime meta-analysis file download issues

I have been downloading the zipped file containing all the QIIME output files for all of the EMP studies that have been QIIME processed and noticed the following issues with the following studies.

go to http://microbio.me/qiime/fusebox.psp
Show Type: EMP
Select study if "Processed by QIIME:" == TRUE
Select Metadata Fields: ALL
Move all to Selected Metadata Fields
click continue
leave defaults on next page
click submit

I have tried these steps after deleting previous meta-analyses and creating new ones to see if behavior is reproducible.

for these studies no files are produced- only states “submitted meta-analysis”- does not display “job XXXX: ...”
Caporaso_illumina_time_series
CaporasoIlluminaPNAS2011_5prime
Gasser_MWC_catchment_microbes
Grossart_German_lake_water_sediment
Haig_WaterPurif_temp_spat
Jansson_Alaskan_fire_chrono_Tanana
Metcalf_SanDiegoZoo_folivorus_primate
Moore_Yucatan_cenotes
Rees_VulcanoIsland_seawaterMedSeA
Spirito_Monensin_Cow_Hindgut

for these studies files are inconsistently produced- states “submitted meta-analysis” and does display “job XXXX: ...”

MacRaeCrerar_Mongolia_soil
Thomas_CMB_Australia_kelp_soil

Jurelivicius_Antarctic_cleanup- sometimes produces an error
(not consistently reproducible)
Submitted Meta-Analyses:
Job 103142: COMPLETED_ERROR

Output: Traceback (most recent call last): File "/home/wwwuser//git/qiime_web_app/python_code/scripts/make_mapping_file_and_otu_table.py", line 89, in main() File "/home/wwwuser//git/qiime_web_app/python_code/scripts/make_mapping_file_and_otu_table.py", line 86, in main jobs_to_start,taxonomy,tree_fp) File "/home/wwwuser/git/qiime_web_app/python_code/generate_mapping_and_otu_table.py", line 521, in write_mapping_and_otu_table otu_table_file_dir_db) File "/home/wwwuser/git/qiime_web_app/python_code/generate_mapping_and_otu_table.py", line 561, in write_otu_table raise ValueError, 'Duplicate prokmsa ids! - %s ' % sample_name1 ValueError: Duplicate prokmsa ids! - SN.2.C.T3C.414524

Hultman_Geochemical_Landscapes_permafrost- sometimes produces an error
(appears reproducible)
Submitted Meta-Analyses:
Job 103208: COMPLETED_ERROR

when all qiime processed studies are selected produces this error
Submitted Meta-Analyses:
Job 103169: COMPLETED_ERROR

build open-reference OTU tables for commit to github

alignment and OTU tree

Find host-associated studies to include in analysis

I am having the San Diego Zoo folivorous monkey gut samples processed this week. Suggest including Becca Safran's barn swallow studies, Se Jin's fish. ? contact Liz Costello for dogs samples. ? Val's frogs. ?Zoo mammal studies.

update of alpha diversity by sample type plots

100 most wanted list

The OTUs that are abundant across many environment types and distance from sequences in Greengenes/NCBI. We'll have to develop a sorting scheme for this, but would be a way to provide a list of the "most wanted" OTUs, or the high abundance cosmopolitan organisms that are not well-characterized.

discrepancy in 806rB reverse primer sequence

Referring to the "16S rRNA Amplification Protocol" section, I noticed that the sequence of the 806rB reverse primer in the file "515f_806_16S_illumina_amplification_protocol_version_6_15.doc" (see page 2) is different from the 806rB sequence found in the file "515f-806rb_new.xls" (see sheet 1 of this Excel spreadsheet).

The difference is a single nucleotide (highlighted in bold below) within the "Primer Pad" region:

806rb sequence from 16S rRNA Amplification protocol version 6_15 (page 2): CAAGCAGAAGACGGCATACGAGATAGTCAGTCAGCCGGACTACNVGGGTWTCTAAT

806rb sequence from "515f_806rb_new.xls" spreadsheet (sheet 1):
CAAGCAGAAGACGGCATACGAGATAGTCAGCCAGCCGGACTACNVGGGTWTCTAAT

Which is the correct sequence for this 806rB reverse primer? I wish to order this primer, and would like to submit the correct sequence for synthesis.

Please let me know if my question is better posed elsewhere on this forum. Thanks!

search for additional support for rare taxa across environments

Chart showing # samples and fraction of samples by sample type (we had discussed adding this display to the front page and keeping it updated)

Summary data that we'd want to include:

how many sequences?
how many collaborators?
how many biomes?
how many samples?
how many otus?

website: Affiliations, Publications, Meetings, Logo

These are from c. 2012 and need to be updated.

load 10k data files

BIOM table (will do by posting on S3 and providing a bash script which wgets it)
master mapping file
tree
new reference sequences
taxonomy assignments (loaded into BIOM, ideally)

Currently waiting on data files from @seangibbons.

similarity between biomes or sample types

Which biomes or sample types are most similar to each other? If you take a new sample (or a site sampled regularly over time that is changing), which biome/sample type does it most closely resemble.

@ElDeveloper has some code for this.

@ackermag has noted we need to make sample_type part of the metadata template. It would be great if we could find an existing ontology to use for this. I think the ENVO environmental material covers most of them, but there are some exceptions. We might need to make our own ontology that is applicable to both the EMP and American Gut.

global modeling

@rob-knight asked: can we do any global modeling stuff with the dataset we have?

May be of interest to @gilbertjack and Peter Larsen.

website: EMP Protocols and Standards

These need to be carefully checked and updated. Known issues:

Sequencing primers are incorrect
Fungal ITS protocol missing
Analysis page refers to closed-ref only

Guide to evaluating studies/metadata in the database

Short tutorial on how to tell if a study is interpretable if stored in the db (i.e. if the data and metadata are of sufficient quality to interpret the results).

website: Getting Involved

Need to update Getting Involved page and Global Environmental Sample Database page, both of which are from 2012.

merge per-study OTU tables to single master OTU table and post to GitHub

The OTU table merge operation is still running (in parallel using this code), and I'm not sure when it's going to complete.

To be on the safe side, we should begin performing analysis on OTU tables independently, as was done for the AAAS meeting. A lot of progress has been made on the making the sparse OTU table objects more efficient, but there is still work to do as is being discussed here and here.

update data URLs

Merge #36 was done to update data-urls.txt, but the links need to be updated again to reflect the most recent locations on ftp server.

variance within biomes (alpha diversity, beta diversity)

blooming taxa analysis

@rob-knight suggested: we should definitely do some blooming taxa analysis as per the Shade paper and maybe encourage Ashley to do that.

Ashley Shade is at Michigan State.

primer effects before/after introduction of new primers

New primers were introduced gradually in 2015. Here's what's different:

Barcodes are now located on the forward (515) primer to enable the usage of various reverse primer constructs to enable longer amplicons (tested on 806r and 926r).
Degeneracy was added to both the forward and reverse primers (see below), with the intent of removing known biases against Crenarchaeota/Thaumarchaeota (515f modification) and the marine and freshwater Alphaproteobacterial clade SAR11 (806r modification).

The hope is that these new primers improve detection of certain groups. But this complicates comparison of community composition before and after the new primers were introduced. We need to make sure that the conclusions we draw from large meta-analyses are not due to primer differences alone.

greengenes coverage tree: how much of greengenes is covered by samples, and how much does EMP add to greengenes?

update of 'places to look for new diversity' plot

Also, filter new OTUs to a new-only OTU table, summarize taxa, and define how many of these new OTUs are uncharacterized at the phylum level, at the class level, ...

biocore / emp Goto Github PK

emp's People

Contributors

Stargazers

Watchers

Forkers

emp's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs