malariagen / malariagen-data-python Goto Github PK

View Code? Open in Web Editor NEW

13.0 8.0 23.0 66.32 MB

Analyse MalariaGEN data from Python

Home Page: https://malariagen.github.io/malariagen-data-python/latest/

License: MIT License

Python 58.11% Jupyter Notebook 41.87% Shell 0.02%

malariagen-data-python's Introduction

`malariagen_data` - analyse MalariaGEN data from Python

This Python package provides methods for accessing and analysing data from MalariaGEN.

Installation

The malariagen_data Python package is available from the Python package index (PyPI) and can be installed via pip, e.g.:

pip install malariagen-data

Documentation

Documentation of classes and methods in the public API are available from the following locations:

Release notes (change log)

See GitHub releases for release notes.

Developer setup

To get setup for development, see this video and the instructions below.

Fork and clone this repo:

git clone [email protected]:[username]/malariagen-data-python.git

Install Python, e.g.:

sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.9 python3.9-venv

Install pipx, e.g.:

python3.9 -m pip install --user pipx
python3.9 -m pipx ensurepath

Install poetry, e.g.:

pipx install poetry==1.8.2 --python=/usr/bin/python3.9

Create development environment:

cd malariagen-data-python
poetry use 3.9
poetry install

Activate development environment:

poetry shell

Install pre-commit and pre-commit hooks:

pipx install pre-commit --python=/usr/bin/python3.9
pre-commit install

Run pre-commit checks (isort, black, blackdoc, flake8, ...) manually:

pre-commit run --all-files

Run fast unit tests using simulated data:

poetry run pytest -v tests/anoph

To run legacy tests which read data from GCS, you'll need to install the Google Cloud CLI. E.g., if on Linux:

./install_gcloud.sh

You'll then need to obtain application-default credentials, e.g.:

./google-cloud-sdk/bin/gcloud auth application-default login

Once this is done, you can run legacy tests:

poetry run pytest --ignore=tests/anoph -v tests

Tests will run slowly the first time, as data required for testing will be read from GCS. Subsequent runs will be faster as data will be cached locally in the "gcs_cache" folder.

Release process

Create a new GitHub release. That's it. This will automatically trigger publishing of a new release to PyPI and a new version of the documentation via GitHub Actions.

The version switcher for the documentation can then be updated by modifying the docs/source/_static/switcher.json file accordingly.

malariagen-data-python's People

Contributors

Stargazers

Watchers

malariagen-data-python's Issues

Raise error if zero samples selected for allele frequency calculations

Might apply to both SNP and CNV frequency methods.

ag3.sample_cohorts will not accept a list of releases

input - a list of releases
output- a df containing the cohort meta data for all samples from those releases.

Add snp_dataset() method to Ag3 class to return xarray dataset

Add a snp_dataset() method returning an xarray dataset with samples, sites and genotype variables, following sgkit conventions.

Add data variables to Ag3.snp_dataset()

Proposed to add some additional variables to the dataset returned by the Ag3.snp_dataset() method, including:

sample_... variables with sample metadata
call_GQ, call_AD, call_MQ
variant_filter_pass_... with site filters arrays

Reverse expected/actual in tests

The convention for pytest in assertions seemds to be to put actual first then expected second, the opposite of what I previously thought. Pycharm follows this in how it reports errors. So to make the pytest and pycharm outputs make more sense, we should swap round the compared variables in all test assertions.

`gene_cnv_frequencies` PerformanceWarning: DataFrame is highly fragmented

We get four warnings about fragmented pandas dataframes when we use the cohort sets (due to how many columns we are building I think) - would be good to fix the code to avoid these.

/home/conda/store/4b0d6587ea35727c87000368f18c95bf1e775a25ab9791007f1cca148f9a452c-binder-v3.2.0/lib/python3.8/site-packages/malariagen_data/ag3.py:1755: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insertmany times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, usenewframe = frame.copy()df[f"{coh}_amp"] = amp_freq_coh /home/conda/store/4b0d6587ea35727c87000368f18c95bf1e775a25ab9791007f1cca148f9a452c-binder-v3.2.0/lib/python3.8/site-packages/malariagen_data/ag3.py:1756: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of callingframe.insertmany times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, usenewframe = frame.copy()df[f"{coh}_del"] = del_freq_coh /home/conda/store/4b0d6587ea35727c87000368f18c95bf1e775a25ab9791007f1cca148f9a452c-binder-v3.2.0/lib/python3.8/site-packages/malariagen_data/ag3.py:1746: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of callingframe.insertmany times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, usenewframe = frame.copy()df[f"{coh}_amp"] = np.nan /home/conda/store/4b0d6587ea35727c87000368f18c95bf1e775a25ab9791007f1cca148f9a452c-binder-v3.2.0/lib/python3.8/site-packages/malariagen_data/ag3.py:1747: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of callingframe.insertmany times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, usenewframe = frame.copy() df[f"{coh}_del"] = np.nan

Veff re-instate within CDS indel effects

I didn't realise all the indel effects within CDSs had got dropped, we might as well reinstate them.

Migrate to using "3.0" instead of "v3" etc. for releases

Try to standardise on common naming for releases, in parameters and dataframes, common to what we say in documentation etc. This would mean, e.g., using "3.0" instead of "v3", "3.1" instead of "v3.1", etc. Would require some translation behind the scenes.

Ag3.snp_calls() cannot take multiple releases for sample_set parameter

Change "pre" to "partner"

Would be simpler to communicate.

Ag3.snp_allele_frequencies() rename "maximum" to "max_af" for clarity

Suggest to rename this column to avoid the very general "maximum".

Improve docstrings

Docstrings could generally be improved, via:

Description of return values
Examples

Simplify handling of empty datasets

When access haplotypes data, there are cases where there are no data for a given sample set and analysis, e.g., for the "arab" analysis where a sample set has no arabiensis samples. Currently the approach is to return None in cases like this.

Similarly, when accessing CNV discordant read calls data, there are no data for some contigs. Again the current approach is to return None.

However, it would probably be simpler, internally at least, if in these cases a dataset was returned, but with a 0 length dimension. E.g., if there are no samples for a given haplotypes dataset, return a dataset with 0 length samples dimension. E.g., if there are no variants for a CNV dataset for a given contig, return a dataset with a 0 length variants dimension.

release v0.15.0 - use latest cohorts analysis

move to mask="gamb_colu" parameter default?

Many methods have a mask parameter. Some require this to be given, others will accept None which defaults to "gamb_colu_arab".

Firstly this should to be the same for all methods and secondly, if we go with the option for None, what should it default to?

Add cohort access to Ag3.snp_allele_frequencies

Split from issue #51

Modify method Ag3.snp_allele_frequencies(transcript, cohorts, cohorts_analysis, ...):

Allow the cohorts parameter to be a dict (current behaviour continues to be supported) or a string. If a string, it must be the name of one of the columns in the cohorts dataframe.
Add a cohorts_analysis parameter which is a string with default value set to latest cohorts analysis. This will be used if the cohorts parameter is given as a string.
Other parameters left as-is.

Open questions:

Should this enforce a minimum sample size? Or give that as an option?

Example:

df_snp_af = ag3.snp_allele_frequencies(transcript="AGAP006028-RA", cohorts="admin1_month")

Use parameter name "cohort_analysis" throughout for consistency
Add "min_cohort_size" parameter to the snp_allele_frequencies() and gene_cnv_frequencies() method, default value 10

Add Ag3 site classes

Add filter columns to SNP effects dataframe

Rather than adding a site_mask parameter and actually removing filtered SNPs from the resulting dataframe, it might be more convenient to return a dataframe with all SNPs, including the three filters as dataframe columns. Then the user can see all SNPs but also have access to information on which filters are passed/failed.

Add Ag3 crosses metadata loaded from crosses.fam

Ag3 method for computing amino acid substitution frequencies

Currently it's possible to compute SNP allele frequencies. But often it's useful to compute frequencies of amino acid substitutions, because (1) this is often what the user is directly interested in, and (2) it simplifies the case where multiple SNPs cause the same amino acid substitution.

Placeholder issue to discuss proposals for design and implementation of such a function.

Upgrade `cohorts_analysis` to 20210927

Upgrade the default value for the cohorts_analysis parameter in Ag3 API.

Decode sample IDs when building snp_calls dataset

For consistency with sample metadata, decode sample IDs so they are strings not bytes.

Add access to predefined cohorts

For analysing data from Ag3, it would be very useful to have predefined sample cohorts for popgen analyses, and for these to be accessible via the API.

CI maintenance

Remove Python 3.6
Remove Ubuntu 18
Add Python 3.9

Simplify handling of SNP effects in Ag3

Currently if a user wants SNP allele frequencies together with SNP effects, they have to call 2 methods and join/merge the resulting dataframes themselves.

It would be simpler IMO if we just added in the effects to the SNP allele frequencies.

E.g., we could change the signature of the Ag3.snp_allele_frequencies() to add an effects=True parameter. If True (default) then the effects would be automatically included in the resulting dataframe.

Chunks parameter appears to be ignored

Running Ag3.snp_genotypes() changing the chunks parameter seems to make no difference.

Move default values for analysis parameters into constants

Several "analysis" parameters have a default value, which may be repeated several times if the parameter appears in several function signatures. It would be better to have these default values declared once via a constant at the top of the module, to make maintenance easier.

Consistify seqid/contig

Currently we use the variable name "contig" throughout to mean a sequence identifier from a reference genome. However, the GFF loading returns a dataframe with the column name "seqid". We could consistify here, by renaming the GFF dataframes to use "contig" instead.

Ag3 method for plotting a heatmap from frequencies dataframes

For reporting it would be useful to have a method for turning a dataframe of frequencies into a heatmap style representation. E.g.:

Support Python 3.9

Add 3.9 to the CI matrix.

(Probably can also now just use ubuntu-latest for CI.)

Include CNV interface

Add interface as discussed to load CNV data

Cache results from SNP effects?

The snp_effects() method is pretty quick, but it would also be possible to cache the resulting dataframe for even quicker repeated calls.

Ag3.geneset() add description to default columns

We often use this, add to defaults.

Renaming of geneset dataframe columns is unnecessary and altering cached dataframe

malariagen-data-python/malariagen_data/ag3.py

Line 735 in 44bb5ab

gs.rename(

Support multiple contigs in CNV datasets

Support providing multiple contigs to the contig parameter to the cnv_... methods in the Ag3 class, returning concatenated datasets, similar to what is currently supported for SNP and haplotype data.

Include cohorts in Ag3 sample metadata

The cohort metadata are very useful and it would make life easier if they were included with the sample metadata by default, rather than having to separately load and merge the sample metadata and cohort metadata dataframes. To implement this, the following changes could be made:

Modify the signature of the Ag3.sample_metadata() method to include a cohorts_analysis parameter with a default value to the latest cohorts analysis.
Modify the implementation of the Ag3.sample_metadata() method to load the cohorts metadata and join it to the general sample metadata, if the cohorts_analysis parameter is not None.

Note that a potential problem with doing this is that some future releases may be released before a new cohorts analysis is performed including it. In this case the release would have sample metadata but no cohorts metadata. This could be mitigated by either (a) ensuring we always run a cohorts analysis immediately following each release and before partners are notified, and/or (b) building in some logic to handle missing cohort files.

Allow auto chunks

malariagen-data-python/malariagen_data/util.py

Line 112 in 9fcd27f

def from_zarr(z, inline_array):

Allow chunks to be automatically decided by dask.

Check for manifest.tsv when discovering releases

malariagen-data-python/malariagen_data/ag3.py

Line 109 in d5df2f5

releases = sorted([d for d in sub_dirs if d.startswith("v3")])

If manifest.tsv is not present, but the release is assumed to exist based on the presence of the directory alone, various things will break.

Compute chunk sizes is slow

malariagen-data-python/malariagen_data/ag3.py

Line 495 in 9fcd27f

d.compute_chunk_sizes()

This could be accelerated as we can compute chunk sizes from the indexer.

Add sample_query parameter to Ag3 methods that return frequencies tables

To simplify building tables of frequencies of either SNPs or CNVs, where often a user might only want to consider a subset of samples, proposed to add a sample_query parameter to the Ag3.snp_allele_frequencies() and Ag3.gene_cnv_frequencies() methods, which applies a pandas query and then only returns cohorts which contain samples from this selection.

Also might want to reconsider what happens when cohorts are too small to calculate frequencies, e.g., drop columns rather than return nans.

Make public the function to open the reference genome for Ag3

Currently private method _open_genome(), would be useful to have public.

Support genome regions when accessing data for Ag3

Several methods in the Ag3 class, including snp_sites(), site_filters(), snp_genotypes() and snp_dataset() take a contig argument and return values for a whole contig.

Proposed to deprecate the contig argument and replace with a more generalised region argument which could be any of the following:

A contig (e.g., "3L")
A contig region (e.g., "3L:1000000-2000000")
A gene (e.g., "AGAP004070")
A list/tuple of any of the above, in which case regions get concatenated

Internally this would require support for locating the indices bounding a region.

Default to GCS URL for Ag3

Mostly we use GCS to access Ag3 data, make it default so we can do:

ag3 = malariagen_data.Ag3()

Modify naming of species metadata columns in Ag3

This issue proposes some modifications to how the species calls are handled and presented through sample metadata.

The main problem coming up is that there are three possible sources of species/taxon assignment, available through different means, and this can be confusing:

AIM species calls, included by default when calling Ag3.sample_metadata(), providing the species column (and several other columns)
PCA species calls, which can be included when calling Ag3.sample_metadata() if the species_calls parameter is given as ("20200422", "pca"), providing the species column (and several other columns).
Cohorts metadata, accessible via the Ag3.sample_cohorts() method, which provides the taxon column.

This is all potentially confusing for the user. Below are some proposed changes to improve this.

AIM species columns

Propose that we change the naming of the AIM species columns, from:

    aim_cols = (
        "aim_fraction_colu",
        "aim_fraction_arab",
        "species_gambcolu_arabiensis",
        "species_gambiae_coluzzii",
        "species",
    )

...to:

    aim_cols = (
        "aim_fraction_colu",
        "aim_fraction_arab",
        "aim_species_gambcolu_arabiensis",
        "aim_species_gambiae_coluzzii",
        "aim_species",
    )

This would make it easier to always talk about the "AIM species" assignment, and have that always visible in dataframes and pivot tables. I.e., it's always clear where the species assignment as come from.

PCA species columns

Propose that we change the naming of the AIM species columns, from:

    pca_cols = (
        "PC1",
        "PC2",
        "species_gambcolu_arabiensis",
        "species_gambiae_coluzzii",
        "species",
    )

...to:

    pca_cols = (
        "pca_species_pc1",
        "pca_species_pc2",
        "pca_species_gambcolu_arabiensis",
        "pca_species_gambiae_coluzzii",
        "pca_species",
    )

In general we don't recommend use of the PCA species assignment, but in case anyone ever does use them, these column names make it clear what is being used.

Notes

Still open for discussion is whether the cohort metadata should get included as well when calling Ag3.sample_metadata(), which would then provide the taxon column, which is the best column to use as it provides the most refined view of taxa within the dataset. However, I'll raise a separate issue to discuss that.

After this change is implemented, there will be some downstream consequences, as the vector data user guide will likely need to be updated (at least rerun), and possibly also partner user guides.

upgrade `cohorts_analysis` to 20211101

Ag3.snp_effects() aa_pos column should be int

Noticing that the aa_pos column is a float, should be an integer.

Add cohort access to Ag3.gene_cnv_frequencies()

Split from #51

Modify method Ag3.gene_cnv_frequencies(contig, cohorts, cohorts_analysis, ...):

Allow the cohorts parameter to be a dict (current behaviour continues to be supported) or a string. If a string, it must be the name of one of the columns in the cohorts dataframe.
Add a cohorts_analysis parameter which is a string with default value set to latest cohorts analysis. This will be used if the cohorts parameter is given as a string.
Other parameters left as-is.

Open questions:

Should this enforce a minimum sample size? Or give that as an option?

Example:

df_cnv_gene = ag3.gene_cnv_frequencies(contig="2R", cohorts="admin1_month")

Say "cohort" instead of "population"

For functions like snp_allele_frequencies, the "populations" parameter specifies groups of mosquitoes. However, "population" isn't really the right term. Also, sgkit is using "cohort" to mean a group of samples. We should probably consistify now, easier than later.

Consolidate arguments for specifying the species calling analysis

malariagen-data-python/malariagen_data/ag3.py

Line 241 in 149398a

 def species_calls(self, sample_sets="v3_wild", analysis="20200422", method="aim"): 

It's awkward to have two arguments here, both method and analysis.

Suggest to combine into one, e.g., analysis="aim_20200422".

Veff maintenance

There's a couple of further pieces of maintenance/tidying we could do on the veff module:

Currently if a variant falls within an intron, we're calling _get_intron_effect(), but we could shortcut and call directly to _get_within_intron_effect(). This should provide a little speedup, and we could probably also delete _get_intron_effect() completely.
We currently define get_effects() as a function, then attach it as a method on the Annotator class manually. This is a bit unconventional and makes IDEs like pycharm complain unnecessarily. We could be a bit more conventional here and just move the functions to be methods.

malariagen / malariagen-data-python Goto Github PK

malariagen-data-python's Introduction

malariagen_data - analyse MalariaGEN data from Python

Installation

Documentation

Release notes (change log)

Developer setup

Release process

malariagen-data-python's People

Contributors

Stargazers

Watchers

Forkers

malariagen-data-python's Issues

AIM species columns

PCA species columns

Notes

Recommend Projects

Recommend Topics

Recommend Org

Jobs

`malariagen_data` - analyse MalariaGEN data from Python