GithubHelp home page GithubHelp logo

malariagen / malariagen-data-python Goto Github PK

View Code? Open in Web Editor NEW
13.0 8.0 23.0 66.32 MB

Analyse MalariaGEN data from Python

Home Page: https://malariagen.github.io/malariagen-data-python/latest/

License: MIT License

Python 58.11% Jupyter Notebook 41.87% Shell 0.02%

malariagen-data-python's Introduction

malariagen_data - analyse MalariaGEN data from Python

This Python package provides methods for accessing and analysing data from MalariaGEN.

Installation

The malariagen_data Python package is available from the Python package index (PyPI) and can be installed via pip, e.g.:

pip install malariagen-data

Documentation

Documentation of classes and methods in the public API are available from the following locations:

Release notes (change log)

See GitHub releases for release notes.

Developer setup

To get setup for development, see this video and the instructions below.

Fork and clone this repo:

git clone [email protected]:[username]/malariagen-data-python.git

Install Python, e.g.:

sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.9 python3.9-venv

Install pipx, e.g.:

python3.9 -m pip install --user pipx
python3.9 -m pipx ensurepath

Install poetry, e.g.:

pipx install poetry==1.8.2 --python=/usr/bin/python3.9

Create development environment:

cd malariagen-data-python
poetry use 3.9
poetry install

Activate development environment:

poetry shell

Install pre-commit and pre-commit hooks:

pipx install pre-commit --python=/usr/bin/python3.9
pre-commit install

Run pre-commit checks (isort, black, blackdoc, flake8, ...) manually:

pre-commit run --all-files

Run fast unit tests using simulated data:

poetry run pytest -v tests/anoph

To run legacy tests which read data from GCS, you'll need to install the Google Cloud CLI. E.g., if on Linux:

./install_gcloud.sh

You'll then need to obtain application-default credentials, e.g.:

./google-cloud-sdk/bin/gcloud auth application-default login

Once this is done, you can run legacy tests:

poetry run pytest --ignore=tests/anoph -v tests

Tests will run slowly the first time, as data required for testing will be read from GCS. Subsequent runs will be faster as data will be cached locally in the "gcs_cache" folder.

Release process

Create a new GitHub release. That's it. This will automatically trigger publishing of a new release to PyPI and a new version of the documentation via GitHub Actions.

The version switcher for the documentation can then be updated by modifying the docs/source/_static/switcher.json file accordingly.

malariagen-data-python's People

Contributors

ahernank avatar alimanfoo avatar cclarkson avatar harbi811 avatar jonbrenas avatar kathryn1995 avatar kellylbennett avatar leehart avatar nkran avatar nw20 avatar sanjaynagi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

malariagen-data-python's Issues

Add data variables to Ag3.snp_dataset()

Proposed to add some additional variables to the dataset returned by the Ag3.snp_dataset() method, including:

  • sample_... variables with sample metadata
  • call_GQ, call_AD, call_MQ
  • variant_filter_pass_... with site filters arrays

Reverse expected/actual in tests

The convention for pytest in assertions seemds to be to put actual first then expected second, the opposite of what I previously thought. Pycharm follows this in how it reports errors. So to make the pytest and pycharm outputs make more sense, we should swap round the compared variables in all test assertions.

`gene_cnv_frequencies` PerformanceWarning: DataFrame is highly fragmented

We get four warnings about fragmented pandas dataframes when we use the cohort sets (due to how many columns we are building I think) - would be good to fix the code to avoid these.

/home/conda/store/4b0d6587ea35727c87000368f18c95bf1e775a25ab9791007f1cca148f9a452c-binder-v3.2.0/lib/python3.8/site-packages/malariagen_data/ag3.py:1755: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insertmany times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, usenewframe = frame.copy()df[f"{coh}_amp"] = amp_freq_coh /home/conda/store/4b0d6587ea35727c87000368f18c95bf1e775a25ab9791007f1cca148f9a452c-binder-v3.2.0/lib/python3.8/site-packages/malariagen_data/ag3.py:1756: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of callingframe.insertmany times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, usenewframe = frame.copy()df[f"{coh}_del"] = del_freq_coh /home/conda/store/4b0d6587ea35727c87000368f18c95bf1e775a25ab9791007f1cca148f9a452c-binder-v3.2.0/lib/python3.8/site-packages/malariagen_data/ag3.py:1746: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of callingframe.insertmany times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, usenewframe = frame.copy()df[f"{coh}_amp"] = np.nan /home/conda/store/4b0d6587ea35727c87000368f18c95bf1e775a25ab9791007f1cca148f9a452c-binder-v3.2.0/lib/python3.8/site-packages/malariagen_data/ag3.py:1747: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of callingframe.insertmany times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, usenewframe = frame.copy() df[f"{coh}_del"] = np.nan

Migrate to using "3.0" instead of "v3" etc. for releases

Try to standardise on common naming for releases, in parameters and dataframes, common to what we say in documentation etc. This would mean, e.g., using "3.0" instead of "v3", "3.1" instead of "v3.1", etc. Would require some translation behind the scenes.

Improve docstrings

Docstrings could generally be improved, via:

  • Description of return values
  • Examples

Simplify handling of empty datasets

When access haplotypes data, there are cases where there are no data for a given sample set and analysis, e.g., for the "arab" analysis where a sample set has no arabiensis samples. Currently the approach is to return None in cases like this.

Similarly, when accessing CNV discordant read calls data, there are no data for some contigs. Again the current approach is to return None.

However, it would probably be simpler, internally at least, if in these cases a dataset was returned, but with a 0 length dimension. E.g., if there are no samples for a given haplotypes dataset, return a dataset with 0 length samples dimension. E.g., if there are no variants for a CNV dataset for a given contig, return a dataset with a 0 length variants dimension.

move to mask="gamb_colu" parameter default?

Many methods have a mask parameter. Some require this to be given, others will accept None which defaults to "gamb_colu_arab".

Firstly this should to be the same for all methods and secondly, if we go with the option for None, what should it default to?

Add cohort access to Ag3.snp_allele_frequencies


Split from issue #51

Modify method Ag3.snp_allele_frequencies(transcript, cohorts, cohorts_analysis, ...):

  • Allow the cohorts parameter to be a dict (current behaviour continues to be supported) or a string. If a string, it must be the name of one of the columns in the cohorts dataframe.
  • Add a cohorts_analysis parameter which is a string with default value set to latest cohorts analysis. This will be used if the cohorts parameter is given as a string.
  • Other parameters left as-is.

Open questions:

  • Should this enforce a minimum sample size? Or give that as an option?

Example:

df_snp_af = ag3.snp_allele_frequencies(transcript="AGAP006028-RA", cohorts="admin1_month")
  • Use parameter name "cohort_analysis" throughout for consistency
  • Add "min_cohort_size" parameter to the snp_allele_frequencies() and gene_cnv_frequencies() method, default value 10

Add filter columns to SNP effects dataframe

Rather than adding a site_mask parameter and actually removing filtered SNPs from the resulting dataframe, it might be more convenient to return a dataframe with all SNPs, including the three filters as dataframe columns. Then the user can see all SNPs but also have access to information on which filters are passed/failed.

Ag3 method for computing amino acid substitution frequencies

Currently it's possible to compute SNP allele frequencies. But often it's useful to compute frequencies of amino acid substitutions, because (1) this is often what the user is directly interested in, and (2) it simplifies the case where multiple SNPs cause the same amino acid substitution.

Placeholder issue to discuss proposals for design and implementation of such a function.

Add access to predefined cohorts

For analysing data from Ag3, it would be very useful to have predefined sample cohorts for popgen analyses, and for these to be accessible via the API.

Simplify handling of SNP effects in Ag3

Currently if a user wants SNP allele frequencies together with SNP effects, they have to call 2 methods and join/merge the resulting dataframes themselves.

It would be simpler IMO if we just added in the effects to the SNP allele frequencies.

E.g., we could change the signature of the Ag3.snp_allele_frequencies() to add an effects=True parameter. If True (default) then the effects would be automatically included in the resulting dataframe.

Move default values for analysis parameters into constants

Several "analysis" parameters have a default value, which may be repeated several times if the parameter appears in several function signatures. It would be better to have these default values declared once via a constant at the top of the module, to make maintenance easier.

Consistify seqid/contig

Currently we use the variable name "contig" throughout to mean a sequence identifier from a reference genome. However, the GFF loading returns a dataframe with the column name "seqid". We could consistify here, by renaming the GFF dataframes to use "contig" instead.

Support Python 3.9

Add 3.9 to the CI matrix.

(Probably can also now just use ubuntu-latest for CI.)

Cache results from SNP effects?

The snp_effects() method is pretty quick, but it would also be possible to cache the resulting dataframe for even quicker repeated calls.

Support multiple contigs in CNV datasets

Support providing multiple contigs to the contig parameter to the cnv_... methods in the Ag3 class, returning concatenated datasets, similar to what is currently supported for SNP and haplotype data.

Include cohorts in Ag3 sample metadata

The cohort metadata are very useful and it would make life easier if they were included with the sample metadata by default, rather than having to separately load and merge the sample metadata and cohort metadata dataframes. To implement this, the following changes could be made:

  1. Modify the signature of the Ag3.sample_metadata() method to include a cohorts_analysis parameter with a default value to the latest cohorts analysis.
  2. Modify the implementation of the Ag3.sample_metadata() method to load the cohorts metadata and join it to the general sample metadata, if the cohorts_analysis parameter is not None.

Note that a potential problem with doing this is that some future releases may be released before a new cohorts analysis is performed including it. In this case the release would have sample metadata but no cohorts metadata. This could be mitigated by either (a) ensuring we always run a cohorts analysis immediately following each release and before partners are notified, and/or (b) building in some logic to handle missing cohort files.

Add sample_query parameter to Ag3 methods that return frequencies tables

To simplify building tables of frequencies of either SNPs or CNVs, where often a user might only want to consider a subset of samples, proposed to add a sample_query parameter to the Ag3.snp_allele_frequencies() and Ag3.gene_cnv_frequencies() methods, which applies a pandas query and then only returns cohorts which contain samples from this selection.

Also might want to reconsider what happens when cohorts are too small to calculate frequencies, e.g., drop columns rather than return nans.

Support genome regions when accessing data for Ag3

Several methods in the Ag3 class, including snp_sites(), site_filters(), snp_genotypes() and snp_dataset() take a contig argument and return values for a whole contig.

Proposed to deprecate the contig argument and replace with a more generalised region argument which could be any of the following:

  • A contig (e.g., "3L")
  • A contig region (e.g., "3L:1000000-2000000")
  • A gene (e.g., "AGAP004070")
  • A list/tuple of any of the above, in which case regions get concatenated

Internally this would require support for locating the indices bounding a region.

Modify naming of species metadata columns in Ag3

This issue proposes some modifications to how the species calls are handled and presented through sample metadata.

The main problem coming up is that there are three possible sources of species/taxon assignment, available through different means, and this can be confusing:

  • AIM species calls, included by default when calling Ag3.sample_metadata(), providing the species column (and several other columns)
  • PCA species calls, which can be included when calling Ag3.sample_metadata() if the species_calls parameter is given as ("20200422", "pca"), providing the species column (and several other columns).
  • Cohorts metadata, accessible via the Ag3.sample_cohorts() method, which provides the taxon column.

This is all potentially confusing for the user. Below are some proposed changes to improve this.

AIM species columns

Propose that we change the naming of the AIM species columns, from:

    aim_cols = (
        "aim_fraction_colu",
        "aim_fraction_arab",
        "species_gambcolu_arabiensis",
        "species_gambiae_coluzzii",
        "species",
    )

...to:

    aim_cols = (
        "aim_fraction_colu",
        "aim_fraction_arab",
        "aim_species_gambcolu_arabiensis",
        "aim_species_gambiae_coluzzii",
        "aim_species",
    )

This would make it easier to always talk about the "AIM species" assignment, and have that always visible in dataframes and pivot tables. I.e., it's always clear where the species assignment as come from.

PCA species columns

Propose that we change the naming of the AIM species columns, from:

    pca_cols = (
        "PC1",
        "PC2",
        "species_gambcolu_arabiensis",
        "species_gambiae_coluzzii",
        "species",
    )

...to:

    pca_cols = (
        "pca_species_pc1",
        "pca_species_pc2",
        "pca_species_gambcolu_arabiensis",
        "pca_species_gambiae_coluzzii",
        "pca_species",
    )

In general we don't recommend use of the PCA species assignment, but in case anyone ever does use them, these column names make it clear what is being used.

Notes

Still open for discussion is whether the cohort metadata should get included as well when calling Ag3.sample_metadata(), which would then provide the taxon column, which is the best column to use as it provides the most refined view of taxa within the dataset. However, I'll raise a separate issue to discuss that.

After this change is implemented, there will be some downstream consequences, as the vector data user guide will likely need to be updated (at least rerun), and possibly also partner user guides.

Add cohort access to Ag3.gene_cnv_frequencies()

Split from #51

Modify method Ag3.gene_cnv_frequencies(contig, cohorts, cohorts_analysis, ...):

  • Allow the cohorts parameter to be a dict (current behaviour continues to be supported) or a string. If a string, it must be the name of one of the columns in the cohorts dataframe.
  • Add a cohorts_analysis parameter which is a string with default value set to latest cohorts analysis. This will be used if the cohorts parameter is given as a string.
  • Other parameters left as-is.

Open questions:

  • Should this enforce a minimum sample size? Or give that as an option?

Example:

df_cnv_gene = ag3.gene_cnv_frequencies(contig="2R", cohorts="admin1_month")

Say "cohort" instead of "population"

For functions like snp_allele_frequencies, the "populations" parameter specifies groups of mosquitoes. However, "population" isn't really the right term. Also, sgkit is using "cohort" to mean a group of samples. We should probably consistify now, easier than later.

Veff maintenance

There's a couple of further pieces of maintenance/tidying we could do on the veff module:

  • Currently if a variant falls within an intron, we're calling _get_intron_effect(), but we could shortcut and call directly to _get_within_intron_effect(). This should provide a little speedup, and we could probably also delete _get_intron_effect() completely.
  • We currently define get_effects() as a function, then attach it as a method on the Annotator class manually. This is a bit unconventional and makes IDEs like pycharm complain unnecessarily. We could be a bit more conventional here and just move the functions to be methods.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.