GithubHelp home page GithubHelp logo

biocore / american-gut Goto Github PK

View Code? Open in Web Editor NEW
105.0 32.0 81.0 416.65 MB

American Gut open-access data and IPython notebooks

License: Other

Python 25.22% CSS 0.03% TeX 0.92% Shell 0.01% Jupyter Notebook 73.83%

american-gut's Introduction

American-Gut

American Gut open-access code and IPython notebooks

A note about data

American Gut sequences and metadata are deposited in The European Bioinformatics Institute under the accession ERP012803.

Bloom sequences found in the data repository are correct and up to date.

OTU tables and mapping files hosted in this repository reflects the state of the project in May 2015 and before. This includes an earlier version of the American Gut survey and dietary questionnaire. Data in GitHub has been scrubbed for PHI. A listing of processed data with the new survey can be found at ftp://ftp.microbio.me/AmericanGut.

The latest OTU tables and precalculated diversity comparisons generated by the primary processing notebook set can be found at ftp://ftp.microbio.me/AmericanGut/latest.

======= American Gut open-access data and IPython notebooks

INSTALL

Basics

American-Gut repository is intended to be used as a project/repo meaning there is no need to install it (ignore setup.py at the moment).

After cloning the repository and before using the scripts user should install necessary dependencies. Two approaches are supported at the moment.

Conda based

If you're choice of package manager is conda dependencies can be installed with

$ conda install --file ./conda_requirements.txt
$ pip install -r ./pip_requirements.txt

If you would like to install dependencies within a conda environment be sure to change to the appropriate environment prior to the installation of dependencies.

Note: Be aware that with pip some libraries will have to be compiled from source so appropriate system libraries should be installed prior to running the pip command. For more details take a look at Supported Systems section.

PIP based

$ pip install numpy==1.9.2
$ pip install -r ./pip_requirements.txt

If you would like to install dependencies within a virtualenv environment be sure to change to the appropriate environment prior to the installation of dependencies.

Note: Be aware that with pip some libraries will have to be compiled from source so appropriate system libraries should be installed prior to running the pip command. For more details take a look at Supported Systems section.

Supported Operating Systems / Distributions

Debian 8

Tested with Debian 8.3.0 (amd64).

To compile dependencies from source appropriate libraries can be installed (as root/sudo) with

(root/sudo)$ aptitude install pkg-config libxslt1-dev libxml2 libfreetype6 \
    build-essential python-pip python-dev liblapack-dev liblapack3 \
    libfreetype6-dev libblas-dev libblas3 gfortran libhdf5-serial-dev libsm6

RUN

Basics

Although American-Gut repo provides separate scripts (scripts folder) and a package (americangut folder) it is primarily intended to be used through notebooks (ipynb folder).

There are a few environment variable that can be used to customize the run:

  • AG_TESTING: if set to True scripts will not download AmericanGut EBI data (ERP012803) but instead work with test data (subset of the original EBI data). This is useful for testing.
  • AG_CPU_COUNT: Number of process to use when parallelizing code (defaults to the number of cores)

To generate reports (pdfs) a TeX distribution should be installed on the system.

Adjusting environment on POSIX systems

Since American-Gut repo contains scripts and packages we need to adjust PYTHONPATH and PATH to reflect this. Therefore, prior to working with notebooks execute the following from within the American-Gut repo:

REPO=`pwd`
$ export PYTHONPATH=$REPO/:$PYTHONPATH
$ export PATH=$REPO/scripts:$PATH

If needed adjust AG_* environment variables from Basics section.

Run notebooks

Notebooks are written in two formats and therefore require different profiles.

Markdown based notebooks

Markdown based notebooks can be found in ./ipynb/primary-processing/ folder and have extension md. To use these notebooks we first need to create a profile for ag_ipymd with

$ ipython profile create ag_ipymd

and adjust newly created /path/to/.ipython/profile_ag_ipymd/ipython_notebook_config.py by adding

#------------------------
# ipymd
#------------------------
c.NotebookApp.contents_manager_class = 'ipymd.IPymdContentsManager'

to the end of the file.

Now, we can start ipython with

$ ipython notebook --profile=ag_ipymd

and visit the newly started notebook server by going to http://localhost:8888

Jupyter/IPython based notebooks

Notebooks in native notebook format (ipynb) can be found in ./ipynb/ folder and have the extension ipynb. To use these notebooks we first need to create a profile for ag_default with

$ ipython profile create ag_default

Now, we can start ipython with

$ ipython --profile=ag_default notebook

and visit the newly started notebook server by going to http://localhost:8888

american-gut's People

Contributors

adamrp avatar antgonza avatar cuttlefishh avatar eldeveloper avatar embrietteh avatar jladau avatar josenavas avatar jwdebelius avatar mortonjt avatar samfway avatar squirrelo avatar teravest avatar wasade avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

american-gut's Issues

Processing Notebooks

Updating the processing notebooks

  • Scripts for processing and plotting (diversity_analysis.py; geography_library.py) #125
  • Preprocessing Notebook #126
  • Power Notebook #127
  • Age Notebook
  • Alcohol Notebook
  • Season Notebook
  • Exercise Notebook
  • Sleep Notebook
  • Plants Notebook

Conflicting Python resources and virtualenvs

I tried to run Daniel's notebooks tonight (11/10). The ipymd release I got (ipymd==0.1.1) required IPython 4.0 or greater. The code in mod2_pcoa.py would not run with IPython > 4.0:

Traceback (most recent call last):
  File "/Users/jwdebelius/.virtualenvs/test_env/bin/mod2_pcoa.py", line 4, in <module>
    __import__('pkg_resources').require('americangut==0.0.1')
  File "/Users/jwdebelius/.virtualenvs/test_env/lib/python2.7/site-packages/pkg_resources/__init__.py", line 3018, in <module>
    working_set = WorkingSet._build_master()
  File "/Users/jwdebelius/.virtualenvs/test_env/lib/python2.7/site-packages/pkg_resources/__init__.py", line 614, in _build_master
    return cls._build_from_requirements(__requires__)
  File "/Users/jwdebelius/.virtualenvs/test_env/lib/python2.7/site-packages/pkg_resources/__init__.py", line 627, in _build_from_requirements
    dists = ws.resolve(reqs, Environment())
  File "/Users/jwdebelius/.virtualenvs/test_env/lib/python2.7/site-packages/pkg_resources/__init__.py", line 805, in resolve
    raise DistributionNotFound(req)
pkg_resources.DistributionNotFound: IPython<4.0.0

I did downgrade my matplotlib to 1.4, and this is a function which calls seaborn, but the error isn't related to those packages.

test_generate_otu_significance.py fails

python test_generate_otu_signifigance.py
.F.......
======================================================================
FAIL: test_calculate_tax_rank_1 (__main__.GenerateOTUSignifiganceTablesTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_generate_otu_signifigance.py", line 245, in test_calculate_tax_rank_1
    self.assertEqual(known_high_10, test_high_10)
AssertionError: Lists differ: [['k__Bacteria; p__Proteobacte... != [['k__Bacteria; p__Proteobacte...

First differing element 0:
['k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterbacteriaceae', 0.002, 7.6e-05, 26.3158, 1.450729834568669e-22]
['k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterbacteriaceae', 0.002, 7.6e-05, 26.0, 1.4507298345686689e-22]

Diff is 671 characters long. Set self.maxDiff to None to see it.

@jwdebelius can you look into this?

Add Travis CI

We now have some test code, so it would be good to have Travis, too

Move ipynb/cluster_utils.py to a new repo

These utilities looks really useful for other project out from the American Gut scope.

Also, removing the IPython dependency will help to make them more general.

Moving this repository to biocore

AFAIK @meganap created an organization called American Gut, should we move this repository there or to biocore. And if we are moving this to biocore, should we delete the American Gut organization as it currently has nothing of relevant to the project and might only confuse people?

Use un-rarefied OTU table

Can we use the un-rarefied OTU table for the results? Or is there a technical reason in the processing pipeline that does not allow us to do this?

Mislabeled metadata

Under TYPES_OF_PLANTS, there is a single survey result that entered 28 rather than 21-30.

Mind if I can submit a PR to fix this. I'm pretty sure it is an error

Help running module2_v1.0 notebook

Hi, I'm trying to run the ipython notebook module2_v1.0 using the test data (debug=True). I can now run the first three blocks of code without errors, but I ran into a few issues:

  1. In the code block that sets up the path for processing (chunk 4), the file BLOOM.fasta is not in the expected location, but copying and pasting it into the created americangut_results_r1-14 folder seems to remedy this.
  2. The 4th chunk then continues until it reaches the jobs = [] line, where it runs into the below error.
ValueError                                Traceback (most recent call last)
<ipython-input-7-960ae518d4d2> in <module>()
     21 for f in glob(os.path.join(working_dir, "*.biom.gz")):
     22     jobs.append(submit(scripts['gunzip'] % {'input': f}))
---> 23 res = wait_on(jobs)
     24 
     25 

<ipython-input-2-4e88260b8c1f> in wait_on(jobs_to_monitor, additional_prefix)
    122     sys.stdout.flush()
    123 
--> 124     running_jobs = parse_qstat()
    125     while jobs_to_monitor:
    126         sleep(POLL_INTERVAL)

<ipython-input-2-4e88260b8c1f> in parse_qstat()
     41 
     42     jobs = {}
---> 43     for id_, name, state in lines.grep(user).fields(0,3,9).fields():
     44         job_id = id_.split('.')[0]
     45         jobs[job_id] = {}

ValueError: need more than 2 values to unpack

Is there a version issue with something I've installed maybe? I installed both the pip requirements and conda requirements packages, and I'm a little lost now. I'm new to python, so any help is appreciated! I'm running this on OSX 10.11.6 with 16 gb memory and python 2.7. Thanks!

conda_req and pip_req differences

Hi,

I'm trying to install americangut and I've stumbled upon some (to me) interesting/strange things. Is there a reason why pip_requirements.txt and conda_requirements.txt differ (e.g. one contains qiime while other doesn't, cython, IPython, pandas, ...).

Generate ovals dynamically from PCoA data

Currently the ellipses (ovals) bounding the data in the PCoA plots are drawn manually in Illustrator and positioned in Latex until they pass the eyeball test. It would be great if the ellipses could be generated dynamically from the first and second coordinates (i.e. flattening the plot into 2 dimensions). Presumably the major axis of the ellipse would coincide with the linear regression of points, and the minor axis then bisects that perpendicularly. Then it's a matter of capturing e.g. 99% of the points. It seems plausible and would look better, be more accurate, and save us time down the road.

test_select_gamma.py fails with ImportError

Looks like test_select_gamma.py tries to import from select_gamma.py which only exists under the scripts folder. This has to be changed such that the library code that's being tested exists under the americangut module. Additionally the test code should not import select_gamma as it is currently doing it, it should import:

from americangut.select_gamma import function_name

@amnona do you think you could take care of this?

The way American-Gut repo was intended to be used?

Hi,

I've been struggling with American-Gut repo and the way I should use it for the past few days. If I understood correctly the repo is broken into a package ('americangut' dir) and auxiliary files. Some of these files are intended to be used by the package itself while others are for interactive sessions with ipython notebooks for example.

In (#199) @jwdebelius recommends installing the package with pip install . -e --no-deps. Therefore, americangut dir indeed was intended to be used as a package. Still, this will not install folders latex and tests from package_data since setup.py seems to be a bit mis-configured (package_data should be a part of src dir of the package).

Also, running (e.g.) 01-get_sequences_and_metadata.md will fail on study_accessions = agenv.get_study_accessions() since it calls get_repository_dir (from results_utils.py) which will strangely take a part of the full path (outside of the package dir) and will try to find 'data' and 'latex' there. Moreover, 'data' isn't even specified in the setup.py.

Therefore, I'm not quite sure how should I use the repo. Should I define PYTHONPATH to include the repo and PATH to include scripts without installing the package or should I install the package (as recommended by @jwdebelius). If I need to install it, what else do I need to adjust to make it work (PATHs, PYTHONPATHs, ...)?

We need a contributing.md document

As we start to expand analyses and contributors, we need contributing guidelines. These probably need to cover the following areas:

  • Requirements for testing code (i.e. separate logic and plotting, plotting does not need to be tested)
  • What data can be hosted on GitHub, and where alternative data should be hosted
  • Structure for IPython Notebooks
  • Whether IPython Notebooks represent a tutorial of how data was generated, or if they need to reflect all data generated.

Analysis Summary Pipeline

I'd like to create something similar to the primary processing block, #161, for analysis, with the idea that the backend could be transferred over to Bokeh or similar later.

I think the easiest way to accomplish this might be a build a data dictionary backend, that would let people operate on the metadata, and then a holding object for the classes able to interact with that object. Test could go on top.

This way, we can have light weight, individual notebooks for each analysis step, and hopefully just switch out plotting code at some point.

I see the steps as:

  • Data Dictionary Objects
    • Parent Question Object (#188)
    • Categorical Question Objects (Categorical, Clinical, Frequency); (#192)
    • Boolean Question Objects (Bool, Multiple response); (#193)
    • Continous Questions (#194)
  • Data handling Object (#195)
  • Data Dictionary for Ag Questions
  • Univariate Alpha diversity notebook with effect size
  • Alpha diversity notebook for Scott Kelley's analysis
  • Univariate Beta diversity notebook with effect size
  • Univariate OTU notebook (different abundance and differential frequency)
  • Univariate PICRUSt Notebook (have discussed this with @mortonjt)
  • Multivariate alpha diversity
  • Multivariate beta diversity

This may be redundant with other repositories, although I think the first step is unique to American Gut.

Remove any commas from participant's name before creating \yourname{} macro

The way the templates are written in LaTeX, having a comma in the user's name, e.g. DALE EARNHARDT, JR., makes the name slightly taller, which pushes some figures onto a second page. Deleting any commas in the name will avoid this. Please do for both gut and skin/oral pipelines.

So instead of

\def\yourname{Dale Earnhardt, Jr.}

the macro should be defined as

\def\yourname{Dale Earnhardt Jr.}

Primary processing revitalized

  • setup runipys similar to IAB
  • tie into .travis.yml, and set export AG_TESTING=True
  • get sequence and metadata (#162)
  • resolve filtering notebook (#163)
  • resolve OTU picking notebook (#164)
  • prepare for meta analyses (#165)
  • resolve alpha diversity analysis notebook (#166)
  • resolve beta diversity analysis notebook (#167)
  • resolve taxonomy summaries notebook (#168)
  • resolve categories collapse (#169)
  • resolve generating of participant results notebook (#170)
  • resolve project wide summary

Template does not include barcode

Can we include the barcode somewhere in the template? Some people have multiple fecal samples, and they will have no way to know which report goes with which sample (unless they go online and compare the images themselves). Maybe we should include the barcode after or below their name (or somewhere less prominent).

TaxTree tests are failing

yoshikivazquezbaeza:American-Gut@master$ python tests/test_taxtree.py 
.....F.
======================================================================
FAIL: test_sample_rare_unique (__main__.TaxTreeTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/test_taxtree.py", line 28, in test_sample_rare_unique
    self.assertEqual(sorted(obs), exp)
AssertionError: Lists differ: [('a', None, [['k__1', 'p__x',... != [('a', None, [['k__1', 'p__x',...

First differing element 2:
('c', None, [], [])
('c', None, [['k__1', 'p__y', 'c__']], [])

  [('a',
    None,
    [['k__1', 'p__x', 'c__'], ['k__1', 'p__y', 'c__3']],
    [['k__1', 'p__x', 'c__1'], ['k__1', 'p__x', 'c__2']]),
   ('b', None, [['k__1', 'p__x', 'c__'], ['k__1', 'p__y', 'c__3']], []),
-  ('c', None, [], [])]
+  ('c', None, [['k__1', 'p__y', 'c__']], [])]

----------------------------------------------------------------------
Ran 7 tests in 0.002s

FAILED (failures=1)

Cannot import biom file in R

I am new to biome file and try to download the biom here for testing. I have click and save the raw file into my hardisk and tried to open it in R using phyloseq and biom package but it returns :

"Error in fromJSON(content, handler, default.size, depth, allowComments, :
invalid JSON input"

I have copied and paste an example url (https://github.com/biocore/American-Gut/blob/master/data/HMP/HMPv35_100nt.biom) to some jason validator on web but it return a error.

Would you instruct on some materials that I can start with working biom files from here? Thank you so much!

Regards,
Carol

nbviewer error

The notebook isn't running properly in its binder environment.
I'm receiving this error

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-r_uybfy_/scikit-bio/

install trouble - conflicting ipython requirements

I'm trying to install americangut, but I run into this error:

$ sudo python setup.py install --prefix=/home/directory
...
Processing dependencies for americangut==0.0.1
error: ipython 3.2.3 is installed but ipython>=4.0.0 is required by set(['ipykernel'])

But when I install ipython 4.1.1 and re-run the same command, the americangut installer replaces it with ipython 3.2.3. I think this might be because IPython<4.0.0 is listed in the conda requirements.

Any ideas on how to get this installed? Thanks!

Add additional stats to taxonomy summary

A participant requested adding additional stats. We could add N (for the group), mean, and stderr. It is likely that for these numbers to make any sense, we'll need to operate off rarified data though which is not ideal for the taxonomy summaries.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.