biocore / american-gut Goto Github PK
View Code? Open in Web Editor NEWAmerican Gut open-access data and IPython notebooks
License: Other
American Gut open-access data and IPython notebooks
License: Other
Hi,
I'm trying to install americangut and I've stumbled upon some (to me) interesting/strange things. Is there a reason why pip_requirements.txt
and conda_requirements.txt
differ (e.g. one contains qiime while other doesn't, cython, IPython, pandas, ...).
As we start to expand analyses and contributors, we need contributing guidelines. These probably need to cover the following areas:
AFAIK @meganap created an organization called American Gut, should we move this repository there or to biocore. And if we are moving this to biocore, should we delete the American Gut organization as it currently has nothing of relevant to the project and might only confuse people?
Looks like test_select_gamma.py tries to import from select_gamma.py which only exists under the scripts
folder. This has to be changed such that the library code that's being tested exists under the americangut
module. Additionally the test code should not import select_gamma
as it is currently doing it, it should import:
from americangut.select_gamma import function_name
@amnona do you think you could take care of this?
These utilities looks really useful for other project out from the American Gut scope.
Also, removing the IPython dependency will help to make them more general.
The way the templates are written in LaTeX, having a comma in the user's name, e.g. DALE EARNHARDT, JR., makes the name slightly taller, which pushes some figures onto a second page. Deleting any commas in the name will avoid this. Please do for both gut and skin/oral pipelines.
So instead of
\def\yourname{Dale Earnhardt, Jr.}
the macro should be defined as
\def\yourname{Dale Earnhardt Jr.}
Can we use the un-rarefied OTU table for the results? Or is there a technical reason in the processing pipeline that does not allow us to do this?
I tried to run Daniel's notebooks tonight (11/10). The ipymd release I got (ipymd==0.1.1) required IPython 4.0 or greater. The code in mod2_pcoa.py
would not run with IPython > 4.0:
Traceback (most recent call last):
File "/Users/jwdebelius/.virtualenvs/test_env/bin/mod2_pcoa.py", line 4, in <module>
__import__('pkg_resources').require('americangut==0.0.1')
File "/Users/jwdebelius/.virtualenvs/test_env/lib/python2.7/site-packages/pkg_resources/__init__.py", line 3018, in <module>
working_set = WorkingSet._build_master()
File "/Users/jwdebelius/.virtualenvs/test_env/lib/python2.7/site-packages/pkg_resources/__init__.py", line 614, in _build_master
return cls._build_from_requirements(__requires__)
File "/Users/jwdebelius/.virtualenvs/test_env/lib/python2.7/site-packages/pkg_resources/__init__.py", line 627, in _build_from_requirements
dists = ws.resolve(reqs, Environment())
File "/Users/jwdebelius/.virtualenvs/test_env/lib/python2.7/site-packages/pkg_resources/__init__.py", line 805, in resolve
raise DistributionNotFound(req)
pkg_resources.DistributionNotFound: IPython<4.0.0
I did downgrade my matplotlib to 1.4, and this is a function which calls seaborn, but the error isn't related to those packages.
The per sample results script currently generates PDFs for the alpha diversity graphs. These can not be easily shown on the AG webpage. PNG results will be much easier to embed.
We will begin to produce 18S sequencing data for participants and need a processing pipeline to produce those results.
@ellexis will work on this
Hi, I'm trying to run the ipython notebook module2_v1.0 using the test data (debug=True). I can now run the first three blocks of code without errors, but I ran into a few issues:
BLOOM.fasta
is not in the expected location, but copying and pasting it into the created americangut_results_r1-14
folder seems to remedy this.jobs = []
line, where it runs into the below error.ValueError Traceback (most recent call last)
<ipython-input-7-960ae518d4d2> in <module>()
21 for f in glob(os.path.join(working_dir, "*.biom.gz")):
22 jobs.append(submit(scripts['gunzip'] % {'input': f}))
---> 23 res = wait_on(jobs)
24
25
<ipython-input-2-4e88260b8c1f> in wait_on(jobs_to_monitor, additional_prefix)
122 sys.stdout.flush()
123
--> 124 running_jobs = parse_qstat()
125 while jobs_to_monitor:
126 sleep(POLL_INTERVAL)
<ipython-input-2-4e88260b8c1f> in parse_qstat()
41
42 jobs = {}
---> 43 for id_, name, state in lines.grep(user).fields(0,3,9).fields():
44 job_id = id_.split('.')[0]
45 jobs[job_id] = {}
ValueError: need more than 2 values to unpack
Is there a version issue with something I've installed maybe? I installed both the pip requirements and conda requirements packages, and I'm a little lost now. I'm new to python, so any help is appreciated! I'm running this on OSX 10.11.6 with 16 gb memory and python 2.7. Thanks!
ERP005651
The barcode prefix is truncated from the AG_full.txt map and the AG.txt file.
A participant requested adding additional stats. We could add N (for the group), mean, and stderr. It is likely that for these numbers to make any sense, we'll need to operate off rarified data though which is not ideal for the taxonomy summaries.
Currently the ellipses (ovals) bounding the data in the PCoA plots are drawn manually in Illustrator and positioned in Latex until they pass the eyeball test. It would be great if the ellipses could be generated dynamically from the first and second coordinates (i.e. flattening the plot into 2 dimensions). Presumably the major axis of the ellipse would coincide with the linear regression of points, and the minor axis then bisects that perpendicularly. Then it's a matter of capturing e.g. 99% of the points. It seems plausible and would look better, be more accurate, and save us time down the road.
I'd like to create something similar to the primary processing block, #161, for analysis, with the idea that the backend could be transferred over to Bokeh or similar later.
I think the easiest way to accomplish this might be a build a data dictionary backend, that would let people operate on the metadata, and then a holding object for the classes able to interact with that object. Test could go on top.
This way, we can have light weight, individual notebooks for each analysis step, and hopefully just switch out plotting code at some point.
I see the steps as:
This may be redundant with other repositories, although I think the first step is unique to American Gut.
This is a bug. When high is of length 1, the list passed in to convert_taxa
on the subsequent line is []
.
@jwdebelius can you fix this ASAP please?
Can we include the barcode somewhere in the template? Some people have multiple fecal samples, and they will have no way to know which report goes with which sample (unless they go online and compare the images themselves). Maybe we should include the barcode after or below their name (or somewhere less prominent).
We now have some test code, so it would be good to have Travis, too
quick fix, just need to add
Hi,
I've been struggling with American-Gut repo and the way I should use it for the past few days. If I understood correctly the repo is broken into a package ('americangut' dir) and auxiliary files. Some of these files are intended to be used by the package itself while others are for interactive sessions with ipython notebooks for example.
In (#199) @jwdebelius recommends installing the package with pip install . -e --no-deps
. Therefore, americangut
dir indeed was intended to be used as a package. Still, this will not install folders latex
and tests
from package_data since setup.py seems to be a bit mis-configured (package_data should be a part of src dir of the package).
Also, running (e.g.) 01-get_sequences_and_metadata.md
will fail on study_accessions = agenv.get_study_accessions()
since it calls get_repository_dir
(from results_utils.py
) which will strangely take a part of the full path (outside of the package dir) and will try to find 'data' and 'latex' there. Moreover, 'data' isn't even specified in the setup.py.
Therefore, I'm not quite sure how should I use the repo. Should I define PYTHONPATH to include the repo and PATH to include scripts without installing the package or should I install the package (as recommended by @jwdebelius). If I need to install it, what else do I need to adjust to make it work (PATHs, PYTHONPATHs, ...)?
runipys
similar to IAB.travis.yml
, and set export AG_TESTING=True
I am new to biome file and try to download the biom here for testing. I have click and save the raw file into my hardisk and tried to open it in R using phyloseq and biom package but it returns :
"Error in fromJSON(content, handler, default.size, depth, allowComments, :
invalid JSON input"
I have copied and paste an example url (https://github.com/biocore/American-Gut/blob/master/data/HMP/HMPv35_100nt.biom) to some jason validator on web but it return a error.
Would you instruct on some materials that I can start with working biom files from here? Thank you so much!
Regards,
Carol
See here
@jwdebelius, can you fix please?
The notebook isn't running properly in its binder environment.
I'm receiving this error
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-r_uybfy_/scikit-bio/
I'm trying to install americangut, but I run into this error:
$ sudo python setup.py install --prefix=/home/directory
...
Processing dependencies for americangut==0.0.1
error: ipython 3.2.3 is installed but ipython>=4.0.0 is required by set(['ipykernel'])
But when I install ipython 4.1.1 and re-run the same command, the americangut installer replaces it with ipython 3.2.3. I think this might be because IPython<4.0.0 is listed in the conda requirements.
Any ideas on how to get this installed? Thanks!
Is it using the new files (e.g., loggedoutheader.psp) that @teravest put together?
Not a pressing issue right now but would be great to have for the time when we need to re-generate these figures.
https://github.com/biocore/American-Gut/blob/master/americangut/generate_otu_signifigance_tables.py#L113 can throw a div by zero error. @jwdebelius, can you check on this?
yoshikivazquezbaeza:American-Gut@master$ python tests/test_taxtree.py
.....F.
======================================================================
FAIL: test_sample_rare_unique (__main__.TaxTreeTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "tests/test_taxtree.py", line 28, in test_sample_rare_unique
self.assertEqual(sorted(obs), exp)
AssertionError: Lists differ: [('a', None, [['k__1', 'p__x',... != [('a', None, [['k__1', 'p__x',...
First differing element 2:
('c', None, [], [])
('c', None, [['k__1', 'p__y', 'c__']], [])
[('a',
None,
[['k__1', 'p__x', 'c__'], ['k__1', 'p__y', 'c__3']],
[['k__1', 'p__x', 'c__1'], ['k__1', 'p__x', 'c__2']]),
('b', None, [['k__1', 'p__x', 'c__'], ['k__1', 'p__y', 'c__3']], []),
- ('c', None, [], [])]
+ ('c', None, [['k__1', 'p__y', 'c__']], [])]
----------------------------------------------------------------------
Ran 7 tests in 0.002s
FAILED (failures=1)
The newest version of ipymd (0.1.2) does not work with ipython notebooks, but works fine with jupyter. We should update to the newest versions of ipython and jupyter to take this into account.
QIIME 1.7-dev, GG 13_5, etc
AG_100nt_even10k.biom.gz -> AG_100nt_even10k.biom
The sample 000004972 has:
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__[Paraprevotellaceae]; g__[Prevotella]; s__
Showing up as rare, which is misleading.
Under TYPES_OF_PLANTS, there is a single survey result that entered 28
rather than 21-30
.
Mind if I can submit a PR to fix this. I'm pretty sure it is an error
python test_generate_otu_signifigance.py
.F.......
======================================================================
FAIL: test_calculate_tax_rank_1 (__main__.GenerateOTUSignifiganceTablesTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_generate_otu_signifigance.py", line 245, in test_calculate_tax_rank_1
self.assertEqual(known_high_10, test_high_10)
AssertionError: Lists differ: [['k__Bacteria; p__Proteobacte... != [['k__Bacteria; p__Proteobacte...
First differing element 0:
['k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterbacteriaceae', 0.002, 7.6e-05, 26.3158, 1.450729834568669e-22]
['k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterbacteriaceae', 0.002, 7.6e-05, 26.0, 1.4507298345686689e-22]
Diff is 671 characters long. Set self.maxDiff to None to see it.
@jwdebelius can you look into this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.