biocore / qurro Goto Github PK

View Code? Open in Web Editor NEW

31.0 6.0 10.0 81.81 MB

Visualize differentially ranked features (taxa, metabolites, ...) and their log-ratios across samples

Home Page: https://biocore.github.io/qurro

License: BSD 3-Clause "New" or "Revised" License

JavaScript 40.04% HTML 5.56% Python 23.79% Makefile 0.30% CSS 0.56% TeX 0.07% R 0.14% Jupyter Notebook 29.55%

microbiome visualization qiime bioinformatics

qurro's Introduction

Qurro: Quantitative Rank/Ratio Observations

(Pronounced "churro.")

What does this tool do?

Lots of tools for analyzing " 'omic" datasets can produce feature rankings. These rankings can be used as a guide to look at the log-ratios of certain features in a dataset. Qurro is a tool for visualizing and exploring both of these types of data.

What are feature rankings?

The term "feature rankings" includes differentials, which we define as the estimated log-fold changes for features' abundances across different sample types. You can get this sort of output from lots of "differential abundance" tools, including but definitely not limited to ALDEx2, Songbird, Corncob, DESeq2, edgeR, etc.

The term "feature rankings" also includes feature loadings in a biplot (see Aitchison and Greenacre 2002); you can get biplots from running DEICODE, which is a tool that works well with microbiome datasets, or from a variety of other methods.

Differentials and feature loadings alike can be interpreted as rankings -- you can sort them numerically to "rank" features based on their association with some sort of variation in your dataset.

What can we do with feature rankings?

A common use of these rankings is examining the log-ratios of particularly high- or low-ranked features across the samples in your dataset, and seeing how these log-ratios relate to your sample metadata (e.g. "does this log-ratio differ between 'healthy' and 'sick' samples?"). For details as to why this approach is useful, check out this open access paper.

How does this tool help?

Qurro is an interactive web application for visualizing feature rankings and log-ratios. It does this using a two-plot interface: on the left side of the screen, a "rank plot" shows how features are ranked for a selected ranking, and on the right side of the screen a "sample plot" shows the log-ratios of selected features' abundances within samples. There are a variety of controls available for selecting features for a log-ratio, and changing the selected log-ratio updates both the rank plot (highlighting selected features) and the sample plot (changing the y-axis value of each sample to match the selected log-ratio).

A paper describing Qurro is now available at NAR Genomics and Bioinformatics here.

How do I use this tool?

Qurro can be used standalone (as a Python 3 script that generates a folder containing a HTML/JS/CSS visualization) or as a QIIME 2 plugin (that generates a QZV file that can be visualized at view.qiime2.org or by using qiime tools view).

Qurro is still being developed, so backwards-incompatible changes might occur. If you have any bug reports, feature requests, questions, or if you just want to yell at me, then feel free to open an issue in this repository!

Demos

See the Qurro website for a list of interactive demos using real datasets.

Screenshot: Visualizing KEGG orthologs in metagenomic data from the Red Sea

This visualization (which uses data from this study, with differentials generated by Songbird) can be viewed online here.

Installation and Usage

You can install Qurro using pip or conda.

System requirements

If you're using Qurro within QIIME 2, you will need a QIIME 2 version of at least 2020.11.

If you're using Qurro outside of QIIME 2, you will need a Python version of at least 3.6 and less than 3.10.

In either case, Qurro should work with most modern web browsers; Firefox or Chrome are recommended.

Installing with `pip`

pip install cython "numpy >= 1.12.0"
pip install qurro

Installing with `conda`

conda install -c conda-forge qurro

Temporary Caveat

Certain characters in column names in the sample metadata, feature metadata (if passed), and feature differentials (if passed) will be replaced with similar characters or just removed entirely:

Old Character(s)	New Character
`.`	`:`
`]`	`)`
`[`	`(`
`\`	`\|`
`'` or `"`	Nothing

This is due to some downstream issues with handling these sorts of characters in field names. See this issue for context.

Tutorials

In-depth tutorials

These tutorials are all good places to start, depending on what sort of data and feature rankings you have.

Color Composition tutorial
- Data Summary: Color composition data from abstract paintings
- Feature rankings: Feature loadings in an arbitrary compositional biplot
- Qurro used through QIIME 2 or standalone?: Standalone
"Moving Pictures" tutorial
- Data Summary: Microbiome 16S rRNA marker gene sequencing data from four types of body site samples
- Feature rankings: Feature loadings in a DEICODE biplot
- Qurro used through QIIME 2 or standalone?: QIIME 2
Transcriptomics tutorial
- Data Summary: Gene expression ("RNA-Seq") data from TCGA tumor and "solid tissue normal" samples
- Feature rankings: ALDEx2 differentials
- Qurro used through QIIME 2 or standalone?: Standalone

Selection tutorial

There are a lot of different ways to select features in Qurro, and the interface can be difficult to get used to. This document describes all of these methods, and provides some examples of where they could be useful in practice.

Selection tutorial

Basic command-line tutorials

These tutorials show examples of using Qurro in identical ways both inside and outside of QIIME 2.

Sleep Apnea tutorial
- Feature rankings: feature loadings in a DEICODE biplot
Red Sea tutorial
- Feature rankings: Songbird differentials

Qarcoal

Qarcoal (pronounced "charcoal") is a new part of Qurro that lets you compute log-ratios based on taxonomic searching directly from the command-line. This can be useful for a variety of reasons.

Currently, Qarcoal is only available through Qurro's QIIME 2 plugin interface. Please see qarcoal_example.ipynb for a demonstration of using Qarcoal.

Poster

We presented this poster on Qurro at the 2019 CRISP Annual Review. The data shown here is already slightly outdated compared to the actual Qurro paper (e.g. the differentials are slightly different), but feel free to check out the poster anyway!

Acknowledgements

Dependencies

Code files for the following projects are distributed within qurro/support_files/vendor/. See the qurro/dependency_licenses/ directory for copies of these software projects' licenses (each of which includes a respective copyright notice).

Vega
Vega-Lite
Vega-Embed
jQuery
DataTables
RequireJS
Bootstrap
Bootstrap Icons
- We make use of the "Question fill" icon's SVG, as well as some example code for embedding this or other icons in CSS.
Popper.js (included within the Bootstrap JS "bundle" file)

The following software projects are required for Qurro's python code to function, although they are not distributed with Qurro (and are instead installed alongside Qurro).

Testing Dependencies

For python testing/style checking, Qurro uses pytest, pytest-cov, flake8, and black. You'll also need to have QIIME 2 installed to run most of the python tests (note that, due to click vs. black vs. QIIME 2 dependency issues, you should use a QIIME 2 environment of at least 2022.8; see CONTRIBUTING.md for details).

For JavaScript testing/style checking, Qurro uses Mocha, Chai, mocha-headless-chrome, nyc, jshint, and prettier.

Qurro also uses GitHub Actions and Codecov.

The Jupyter notebooks in Qurro's example_notebooks/ folder are automatically rerun using nbconvert, also.

Data Sources

The test data located in qurro/tests/input/mackerel/ were exported from QIIME 2 artifacts in this repository. These data are from Minich et al. 2020 [1].

The test data located in qurro/tests/input/byrd/ are from this repository. These data, in turn, originate from Byrd et al.'s 2017 study on atopic dermatitis [2].

The test data located in qurro/tests/input/sleep_apnea/ (and in example_notebooks/DEICODE_sleep_apnea/input/) are from this Qiita study, which is associated with Tripathi et al.'s 2018 study on sleep apnea [4].

The test data located in qurro/tests/input/moving_pictures/ (and in example_notebooks/moving_pictures/data/) are from the QIIME 2 moving pictures tutorial. The ordination files in these folders were computed based on the DEICODE moving pictures tutorial. These data (sans the DEICODE ordination) are associated with Caporaso et al. 2011 [5].

Lastly, the data located in qurro/tests/input/red_sea (and in example_notebooks/songbird_red_sea/input/, and shown in the screenshot above) were taken from Songbird's GitHub repository in its data/redsea/ folder, and are associated with Thompson et al. 2017 [3].

Logo

Qurro's logo was created using the Lalezar font. Also, shout out to this gist for showing how to center images in GitHub markdown files (which is more of a hassle than it sounds).

Special Thanks

The design of Qurro was strongly inspired by EMPeror and q2-emperor, along with DEICODE. A big shoutout to Yoshiki Vázquez-Baeza for his help in planning this project, as well as to Cameron Martino for a ton of work on getting the code in a distributable state (and making it work with QIIME 2). Thanks also to Jamie Morton, who wrote the original code for producing rank and sample plots from which this is derived.

And thanks to a bunch of the Knight Lab for helping name the tool :)

Citing Qurro

If you use Qurro in your research, please cite it! The preferred citation for Qurro is this manuscript at NAR Genomics and Bioinformatics. Here's the BibTeX:

@article {fedarko2020,
    author = {Fedarko, Marcus W and Martino, Cameron and Morton, James T and González, Antonio and Rahman, Gibraan and Marotz, Clarisse A and Minich, Jeremiah J and Allen, Eric E and Knight, Rob},
    title = "{Visualizing ’omic feature rankings and log-ratios using Qurro}",
    journal = {NAR Genomics and Bioinformatics},
    volume = {2},
    number = {2},
    year = {2020},
    month = {04},
    issn = {2631-9268},
    doi = {10.1093/nargab/lqaa023},
    url = {https://doi.org/10.1093/nargab/lqaa023},
    note = {lqaa023},
    eprint = {https://academic.oup.com/nargab/article-pdf/2/2/lqaa023/33137933/lqaa023.pdf},
}

References

[1] Minich, J. J., Petrus, S., Michael, J. D., Michael, T. P., Knight, R., & Allen, E. E. (2020). Temporal, environmental, and biological drivers of the mucosal microbiome in a wild marine fish, Scomber japonicus. mSphere, 5(3), e00401-20. Link.

[2] Byrd, A. L., Deming, C., Cassidy, S. K., Harrison, O. J., Ng, W. I., Conlan, S., ... & NISC Comparative Sequencing Program. (2017). Staphylococcus aureus and Staphylococcus epidermidis strain diversity underlying pediatric atopic dermatitis. Science Translational Medicine, 9(397), eaal4651. Link.

[3] Thompson, L. R., Williams, G. J., Haroon, M. F., Shibl, A., Larsen, P., Shorenstein, J., ... & Stingl, U. (2017). Metagenomic covariation along densely sampled environmental gradients in the Red Sea. The ISME Journal, 11(1), 138. Link.

[4] Tripathi, A., Melnik, A. V., Xue, J., Poulsen, O., Meehan, M. J., Humphrey, G., ... & Haddad, G. (2018). Intermittent hypoxia and hypercapnia, a hallmark of obstructive sleep apnea, alters the gut microbiome and metabolome. mSystems, 3(3), e00020-18. Link.

[5] Caporaso, J. G., Lauber, C. L., Costello, E. K., Berg-Lyons, D., Gonzalez, A., Stombaugh, J., ... & Gordon, J. I. (2011). Moving pictures of the human microbiome. Genome Biology, 12(5), R50. Link.

License

This tool is licensed under the BSD 3-clause license. Our particular version of the license is based on scikit-bio's license.

qurro's People

Contributors

Stargazers

Watchers

Forkers

mortonjt fedarko eldeveloper antgonza gibsramen cameronmartino genomicsnx jwdebelius ahdilmore lucaspatel

qurro's Issues

Properly handle empty textual numerator/denominator fields

If either the numerator or denominator's text field (from the multi selections) are empty, handle it accordingly: JS treats "" as being in every string, so we have to figure out a workaround for that. Just checking if the trimmed input is "" -- and if so, giving the user an error -- should be sufficient.

Sample plot legend text cut off?

to replicate: produce a standalone rrv plot with the category set to "age" for the DEICODE pseudomonas dataset.

TODOs for diagnosing the problem:

check if generating a Q2 plot also exhibits this error (I would imagine so, but you never know)
- Yeah, the Q2 plot has this as well.
check if other numeric fields also exhibit this error
try replicating this with smaller datasets
look into if other people have had this problem with vega/vega-lite/vega-embed/altair

Weirdly enough, changing the config width/height fields of the sample plot JSON doesn't seem to fix this: making them really huge just makes the chart larger, and the legend text is still cut off. Resizing the samplePlot div via CSS doesn't seem to change anything, either (the sample plot retains its set size via the JSON config).

Add tests/validation

This should validate that the sample plot JSON generated by gen_plots.py matches the JSON file generated by the old Jupyter Notebook I was doing testing on. (That'll be a bit complicated by the fact that gen_plots.py now doesn't make a default example, but it should be possible to ignore that in parsing.) The python json library should be useful for that.

This should also test:

~~that the sample log ratios make sense~~
~~every command-line option we add to gen_plots.py (sample exclusion, data sources, etc.) works as intended~~
interactions in JS (linking is enforced at every step)
~~different data sets~~
- the ideal thing here would be automating a test process using Songbird, DEICODE, or another tool that produces taxon ranks where we go from the inputs to the input data for RankRatioViz to the output visualizations. (Of course, if that would necessitate a prohibitive amount of time, we can just substitute the output of Songbird/DEICODE/etc. as an input for tests.) Update: the Jupyter Notebook takes the place of the automated ranks-to-viz approach, and our test process currently includes both DEICODE and songbird data.
- ~~also the Red Sea dataset (according to Jamie) would be good here~~
malformed input: e.g.
- negative or non-numeric abundances
- no abundances at all
- ~~no samples given~~
- empty taxa (or no ranks, or periodic empty lines, etc.)
  - especially combinations: e.g. a properly formatted numerator and an empty text denominator, vice versa, empty both fields, properly formatted both fields, etc.
- attempts at code injection somehow
  - The JS runs in the client side, and I'm pretty sure my code isn't vulnerable to this, but it's worth double-checking. Also if we decide to make this into an actual server-side application where the user can upload their files and we run gen_plots on the backend, then it becomes a more significant issue.
- A lot of this should be detected in gen_plots, but I suppose it's worth adding some basic checks to the client-side JS in case the JSONs get damaged/corrupted/etc. in between their generation and their loading.
- ~~"non-matching" inputs (e.g. samples in the metadata file aren't in the BIOM table; taxa in the rank file aren't in the BIOM table).~~
  - some of this should be acceptable. for example, there are taxa in the BIOM table that aren't described in the rank plot for the given Byrd sample data -- the bigger problem is stuff not being in the BIOM table I guess. But take some time to think this over.
~~boxplot functionality! OK to just check that the sample plot JSONs being generated are reasonable.~~
new ranking functionality (check that the window transform works?)
bar width controls?
- check that the warning pops up when widths of < 1 pixel are used?
probably more things that I'm forgetting and will add here later
That filters are being updated properly?
That multi-feature selection updates balances and feature classifications properly

Make bars in rank plot perfectly adjacent

Ideally the rank plot wouldn't have any horizontal space between bars. However, through zooming in, it seems like it's always going to eventually devolve into each bar being a disconnected line. I've tried a few different configuration options in Altair to try to fix this and haven't been successful yet. Here's a recreation of some unsuccessful attempts for reference:

.mark_bar(binSpacing=0).encode(...)

.configure_scale(
    bandPaddingInner=0,
    bandPaddingOuter=0,
    continuousPadding=0,
    pointPadding=0
)

Accept scikit-bio OrdinationResults as input

code review idea (c/o @cameronmartino)

Add option to take log geometric mean of abundances when using multiple taxa

Jamie's suggestion.

Edit: for reference, this would let Qurro use the same definition of ``balance'' as is shown in the Selbal paper.

This should be doable in Qurro's JS by replacing the calls in updateBalanceMulti() to sumAbundancesForSampleFeatures with separate calls to a new function (maybe we could call it geometricMeanOfSampleFeatures?) that computes the geometric mean of the selected feature counts for a given sample.

For the UI, I expect this would be easiest to do in such a way that changes to this setting take effect upon the next time a log-ratio is computed? Or we could store state info about the "previous" log-ratio and, on a change, re-update the sample plot.

It would also be good to make note of this in the sample plot, either in the y-axis or below the plot somehow -- this way people don't accidentally use results from "sum" log-ratios when they think they're actually looking at "geom. mean" log-ratios, or vice versa. I think this might necessitate destroying then recreating the sample plot in order to update the y-axis :| but that's doable, I guess.

Make text searches case insensitive

Don't think there's any good reason to preserve case sensitivity.

Should be easy to implement -- just convert inputs to lowercase, and convert everything you search over at any given moment to lowercase. (I guess we should preserve the case of the taxa listed in the rank file/in the BIOM table, but that is going to make us consistently re-lowercase stuff which will lead to a slight performance decrease.)

idea c/o Lisa

don't bother copying in the favicon to Q2 visualizations

since the default q2 one remains as far as I can tell

This might necessitate slightly modifying the HTML based on whether or not it's being loaded in a Q2 or standalone context.

Create server-side interface?

Where users could upload their files (biom table, ranks, metadata, etc.) and the backend runs gen_plots or something to produce visualizations without the user having to do any extra command-line stuff.

This'd necessitate a thorough look at security, efficiency, etc., as well as the resources needed to host this, so keeping the current distributed model for this (or just hosting the HTML stuff but making users run gen_plots themselves so that the JS is all client-side) is probably much more realistic in the short term.

Use clearer labels for field headers in tooltips

esp. in the sample plot -- column mapping means that it just says "0" for the sample ID field, but something clearer like "sample ID" should be displayed. Maybe if we can access the tooltip in advance or something? look into Vega-Embed and Vega/Vega-Lite's options.

Accept metabolite (GNPS) feature metadata, and validate this in tests

At least with the Red Sea metabolite data used in the QIIME 2 tutorial on Songbird, this seems to just be a TSV file where each line contains a metabolite (feature) ID, a "cluster," and a description. I don't know if the "cluster" is important (maybe?) but, in any case, making this work should be as simple as adjusting the code in rankratioviz.generate.process_input() that reads taxam (feature metadata) to create labels for each taxon/metabolite (and using the correct column names for the metabolites). It might be easiest to just have the user pass in a --metabolite-data flag or something, so we know how to read the feature metadata file properly [1]. Update: yeah, these features are KEGG orthologs, not metabolites. My bad for not reading that paper in detail. But it's still cool that this supports those types of features!

So what we actually want to do for legit metabolomics feature metadata, from what I can tell, involves handling the feature ID used in the BIOM table/ranks input as a combo of the mass-to-charge-ratio and discharge time, and extracting the Library ID from a feature metadata file in which both parts of the feature ID are their own columns.

To validate this, we should at least

include metabolite data in the test suite (#2) and check that labels are being properly created, and
check this against lots of metabolite data to ensure that we're parsing things according to whatever standards for metabolite data file formats/metabolite feature IDs/etc. exist, and that this tool is robust enough to be useful in all of these contexts.

Reduce use of multiple Vega-Embed run() operations

Make ssmv.changeSamplePlot() remove and then restore all the JSON for the sample plot (with badSamples manually filtered out) each time, rather than calling run() at each change point.

Support altering which ranks are used in the browser

i.e. for the Byrd dataset, this would be supporting something besides log(PostFlare/Flare) + K.

Ideally this would entail us storing all of the ranks for every taxon/metabolite in the rank plot JSON, and then just altering around which rank(s)/combination of ranks is used in the rank plot. This should also alter the y-axis title of the rank plot accordingly.

In terms of actual code changes: right now RRV just gets the first rank column (since the rank_col
parameter of gen_rank_plot() is set as 0 when both the standalone and Q2 scripts call it). I guess in order to fit an arbitrary amount of ranks into the rank plot JSON, we'd just need to throw in multiple coefs Series (we'd have to name these in a consistent way -- maybe ranks0, ranks1, ranks2, ... or coefs0, coefs1, coefs2, ... -- doesn't really matter as long as it's consistent.)

In terms of the JS side of things, the user would be presented with a list of all available ranks in the browser (ideally with a nice description, but I don't think we're guaranteed to have that in the OrdinationResults so this might end up just being an Emperor-esque thing where you have a list of ranks and their "proportion explained" -- this would be a cool place to use a Scree plot as Emperor does). The user could then select a rank column to use, and the rank plot would thus change accordingly.

Empty plots produced

I think this is due to a mismatch of input data somehow -- e.g. samples in the biom table don't relate to the metadata file, or features in the biom table don't relate to the taxonomy file, or ranks in the ordination file don't relate to features.

I'm experiencing this with the Byrd skin data when thrown into Q2, and in other cases. I'll try to figure out what's behind this.

"Assert" various properties of the input

~~No feature/sample/feature metadata/sample metadata ID is used twice~~
Same amount of ranks per feature
- might be easiest to just ensure each feature has a ranking.
- ~~Looks like skbio.OrdinationResults.read() handles the feature loading formatting stuff -- so we'd just need to validate the differentials-reading, i.e. in _rank_utils.differentials_to_df().~~
~~Each feature given in the input ranks matches up to an entry in the feature metadata/biom table~~
~~At least one sample given in the input sample metadata matches up to an entry in the biom table~~
~~At least two features~~
~~At least one sample~~
~~At least one sample metadata field~~

There are definitely more of these that I'm not thinking of. Each of these assertions should be accompanied by a test (#2) to ensure that they trigger properly in invalid cases.

I imagine a lot of the QIIME 2 niceness will reduce the amount of times we run into these sort of problems, but the more of these we can anticipate the stronger this codebase will be. (Plus we can't rely on Q2 for the code that's shared between the standalone and Q2 versions of rankratioviz.)

make zero-fill-input apply to single selection, also?

Not 100% sure this would be useful, but it would at least make this behavior consistent.

Try to store metadata and abundances in different datasets

There's technically a chance a metadata column name and taxon name will clash. At the bare minimum, we should be able to detect if this is the case in generate.py and raise an error. Ideally, though, this shouldn't be a problem, and metadata and taxon info should be distinct.

Show box (/swarm) plots for categorical metadata plots, and scatter plots for numerical metadata plots

e.g. box plots, like in https://github.com/knightlab-analyses/reference-frames/blob/master/ipynb/Figure3.ipynb

Make feature metadata required status consistent btwn. standalone and Q2 versions

Currently it's required in Q2 and optional in standalone.

I guess it's not necessary, since we can just go off the sequence information, but if we allow that then we'd need to add tests to ensure that everything works properly with that (#2).

Ensure that interactions work when done in piecemeal

Ensure interactions work when the user cancels an action to do something else. e.g. they click to select a numerator in the rank plot, then do stuff with the multi selection stuff, then click to select a denominator in the rank plot. I think this works ok now, but double check. Probably good to add test case(s) for this especially.

To make this clear to the user, it'd be a good idea to add some helper UI elements indicating what is going on (e.g. "new numerator of [xyz] selected. select another rank to add a new denominator for the log ratio and update the sample scatterplot." or something).

Fuzzy searching for taxa

See comments of #12 for discussion of merits of this

Iron out which minimum dependency versions, exactly, are needed

Current dependencies (for which this is a concern):

altair
biom-format
click
pandas
pytest (not necessarily required, but I think this is a good idea to require anyway so the user can verify their installation is working if desired)
- update: this, pytest-cov, flake8, and black can be saved as dev requirements in setup.py
scikit-bio
numpy

To get context about how each of these libraries are used within rankratioviz, you can navigate to the root directory of the repo and run grep -ri "alt\." * -A 2, where alt is just the name of the module as imported (e.g. skbio for scikit-bio, or pd for pandas). The -A option can be configured to show context after matching occurrences, which helps us figure out what not only what module functions are called but what arguments of these functions are used. This isn't a perfect overview of things -- e.g. if a "from [module] import [thing]" command is used this won't catch that, or if you declare an object of type [module.thing] then this won't show all the ways that object is used -- but it's a solid start.

(also, note that that search command might fail if you somehow invoke a module on one line and then invoke its function on the next line -- is that even doable in python? -- but we're following flake8 on this project so i don't think that'll happen here)

Display stats about samples excluded due to lack of microbe presence

Make updates to textareas propagate to sample plot and rank plot

This would be really cool (and would let the user do stuff like drop out individual taxa to fine tune an analysis).

Probably would be easiest to add an "update" button or something, or make pressing enter apply this change -- doing this as the user types would be slow and imprecise I think

Add option to switch sample plot x-axis and color to arbitrary metadata column(s)

e.g. using age instead of SCORAD (idea from Lisa)

Consider moving plot JSON generation to JavaScript

This would remove the need for running the gen_plots.py script, which would improve convenience. But I'm not sure some of the more intensive operations in that script (reading some large files, performing DataFrame merges, etc) would be suitable for JS. (Also I don't think biom files can be read in JS now -- see biocore/biom-format#699 -- so that might disqualify this.)

Avoid hardcoding microbe column start point: support arbitrary amounts of sample metadata

We use a hardcoded value of 24 to determine the minimum column position to search through when doing textual queries on taxa (so that we don't consider metadata/etc. columns as taxa to search).

However, this tool should support arbitrary amounts of metadata, so this value should not be hardcoded in this way. At the very least, it should be a parameter the user can define -- however, I think we can programmatically determine this and then store it somewhere in the sample plot JSON in gen_plots.py (before the second pd.merge() operation, right?).

The rank plot doesn't include the taxon at column ID 3

There should be 609 taxa in the deicode pseudomonas dataset (608 - 0 + 1 = 609), and the ordination file confirms this (it says there are 609 species).

Diagnosed problem via:

var taxa = Object.keys(ssmv.samplePlotJSON["datasets"]["col_names"]);
for (var i = 0; i < taxa.length; i++) {
    if (!ssmv.topTaxa.includes(taxa[i])) {
        console.log(i);
        console.log(taxa[i]);
    }
}

in browser console, for a standalone rrv plot, while trying to fix one of the later problems in #40.

Dynamically set plot dimensions based on available size in their parent <div>

Allow selection of taxa by explicit taxonomic ranks

"Contains the rank," irrespective of what level it is
Domain
Kingdom
Phylum (or Division -- apparently the two terms are kinda synonymous)
Class
Order
Family
Genus
Species

Alternately, just allow users to specify ranks to be searched for (delineated by spaces or commas or semicolons or whatever), and just search for those ranks without making any guarantees as to their actual level. That's probably the more convenient solution due to discrepancies in classifications.

Support temporarily filtering out taxa from rank plot

Idea c/o Shi.

Useful for excluding ranks in middle of plot? i.e. with little-to-no effect on differences.

Resolve temporary solution of using altair=2.2.2

Current options (if vega/altair#1262 gets fixed then we can just continue using the most recent version of altair on conda-forge, as we have been):

Add jinja2 to environment.yml
Use pdvega (this would probably make installing more lightweight, at the cost of the code getting a bit less clean + changing things around getting a bit harder due to the need for more direct JSON dict manipulations)
Keep altair version at 2.2.2 (not a good long-term option IMO, since this makes us lose out on future fixes/updates to altair)

Add copyright heading to all code files

It's in a decent amount, but should be in all (or at least all nontrivial files). ideally this would include the HTML/CSS/JS files that comprise a visualization.

Take in "select microbes" file in python script

this would filter the rank plot accordingly, saving space. another idea from code review

add automated style checking in .travis.yml with flake8 or something

going to be good to stick to that from here on

Abstract out shared Q2 and standalone code

Similar to how q2-Emperor handles certain things by just delegating them to Emperor. This doesn't have to mean making a new package or whatever, though -- just maybe a shared module or something.

Visualize "select microbes" on rank plot

Maybe as another color type? Might cause some weird conflicts with extant log ratios, though.

Make both songbird beta.csv and DEICODE ordination files fully supported

Eventually songbird might output some sort of Q2 artifact (see mortonjt/songbird#32), but for the time being just accept beta.csv files.

I've already written some basic code to convert ordination.txt and beta.csv files into the same format (just a dict mapping feature IDs to a tuple of all ranks given for these features). It's only used in the test code right now, though. Need to integrate this into the main part of rankratioviz, and then we should be good. (Still might need to add in some extra logic to smartly distinguish between songbird and DEICODE output, or we can just make the user add a --deicode or --songbird flag or something.)

Add toggleable div with all taxa in numerator/denominator of log ratio

Might subsume the MathJax single-microbe selections into this

Full support for metabolite data

Just setting this issue up to hold all the various miscellaneous tasks necessary to support metabolite data. (Large tasks should still get their own issues.)

A few I can think of now:

~~Changing x-axis of rank plot to not say "taxa"~~
Changing search functionality re: metabolite naming conventions
~~Changing search labels ("Filter numerator to taxa that...") to not say "taxa"~~
~~Not saying "Microbe Selections", etc~~

Maybe just saying "taxa/metabolites" or "microbes/metabolites" is just a good compromise. Or we could have the user specify in the input command if metabolites or taxa are involved, and adjust all these UI elements accordingly... but there very well may be some datasets where both metabolites and taxa are ranked, so that might be an ineffective solution.

Add option to use multiple types of ranks

c/o @mortonjt

Add options to clear select microbes file without reloading the page

This should force an update to the sample scatterplot, but be careful it doesn't mess with #5.

Get this properly set up as a Python package

Rank search is broken?

When searching by exact rank, if you give it an entire rank name (including taxonomy, confidence, and sequence) it'll highlight everything (or almost everything).

Text search seems to be working though.

Figure out how to fix this and fix it. Add a test case for it also.

More abstraction of fields used in python script

Includes:

Rank Plot Y-axis title
Sample Plot X-axis title
Sample Plot X-axis field used from metadata table
Sample Plot point colorations
- Autodetect and use all available options to create a color scheme? ideally, this would optionally accept a user-provided mapping of values to colors, but that might be approaching overkill.
Rank Plot column to use for ranks
- There's code stashed in my folder for this repo that IIRC mostly has this done -- need to get that out and push it here

Support sample exclusion in the browser

Instead of in the python script.

This should be doable -- just set the excluded sample values in the log ratio to NaN, simliarly to how we do that when either part of the log ratio's abundance is 0.

Add button for exporting screenshots

Presumably this would export both plots, although it might be easiest to just add two buttons (one for each plot) at first.

Add command-line options to gen_plots.py

Add Vega, Vega-Lite, Vega-Embed in a "vendor" directory

Analogous to Emperor's support_files/vendor directory (see here). This lets rankratioviz be used without needing an internet connection, since it removes the need to access the CDN for these libraries. (And Python 3's http.server module works without an internet connection, as far as I can tell.)

This also makes rankratioviz a bit more polite, because it won't have to make any more requests to the various Vega CDN URLs. The downsides we should be aware of here are:

Licensing (we should probably distribute the Vega/Vega-Lite/Vega-Embed licenses with their JS code, or at least link to the licenses from the README).
Version stagnation (since we'd need to manually update the Vega/Vega-Lite/Vega-Embed versions, rather than just auto-fetching the latest). There is probably an automatic way to handle this, but I would imagine that as long as we periodically check back and update these versions it shouldn't be a major problem (but worth consulting with other people re: how they've handled this problem in the past).