GithubHelp home page GithubHelp logo

hagax8 / ancestry_viz Goto Github PK

View Code? Open in Web Editor NEW
11.0 2.0 0.0 157 KB

GTM and t-SNE classification and clustering of 1000 Genomes Project populations

Home Page: https://lovingscience.com/ancestries/human

License: MIT License

Python 100.00%
gtm t-sne generative-topographic-mapping ancestry 1000genomes clustering

ancestry_viz's Introduction

Ancestry clustering

This tutorial explains how to cluster and classify genomes using 1000 Genomes data with GTM (our ugtm implementation) and t-SNE (sklearn implementation). Data files used in the following tutorial can be downloaded from https://lovingscience.com/ancestries.

Requirements

The following python packages are required:

  • ugtm
  • sklearn
  • altair
  • matplotlib
  • numpy
  • pandas

Files in directory

  • worldmap_1000G.py: Python script, creates interactive visualization gathering GTM, t-SNE and PCA
  • runGTM.py: runs GTM (using ugtm package) or t-SNE (using sklearn)
  • data: directory, contains csv data files
  • data/dataframe_1000G_noadmixed.csv: csv file, 1000 Genomes project dataframe with corresponding t-SNE and GTM coordinates

Download files

You can download files for ancestry classification using 1000 genomes Phase 3 data from here, which are already formatted for this software. In this tutorial, we will use the following files:

You can find out how these files were created by clicking here.

Build GTM and t-SNE maps

To build a GTM with parameters [k,m,l,s] = [16,4,0.1,0.3] and 10 principal components, run the following command:

python runGTM.py --model GTM --data recoded_1000G.noadmixed.mat \
--labels recoded_1000G.raw.noadmixed.lbls3 --labeltype discrete \
--out outputname --pca --n_components 10 --regularization 0.1 \
--rbf_width_factor 0.3 --missing --missing_strategy median \
--random_state 8 --ids recoded_1000G.raw.noadmixed.ids

It should be noted that our genotype file has missing values that we are handling with the --missing and --missing strategy options. You should obtain a pdf and an html file. The html file looks like this: 1000G_GTM_20populations.html

To build a t-SNE map, run:

python runGTM.py --model GTM --data recoded_1000G.noadmixed.mat \
--labels recoded_1000G.raw.noadmixed.lbls3 --labeltype discrete \
--out outputname --pca --n_components 10 \
--missing --missing_strategy median \
--random_state 8 --ids recoded_1000G.raw.noadmixed.ids

Click here to access the t-SNE map: 1000G_t-SNE_20populations.html

Evaluation of classification performances in a crossvalidation experiment, compare GTM and linear SVM:

python runGTM.py --model GTM --data recoded_1000G.noadmixed.mat \
--labels recoded_1000G.raw.noadmixed.lbls3_3 --labeltype discrete \
--out outputname --pca --n_components 10 \
--missing --missing_strategy median \
--random_state 8 --crossvalidate

This will give us per-class reports. Default class priors are equiprobable (cf. --prior option), which is generally only OK if classes are balanced. For imbalanced classes, use "--prior estimated" option.

Train on provided data and project a test set onto the map:

The great thing about generative topographic mapping (GTM) is that we can project external test sets on the map without having to re-train the map. The ugtm package also includes some nice functions for classification models and generates posterior probabilities for test set individuals (--test) to belong to a specific class, based on the class labels (--labels) of the training set (--data).

python runGTM.py --model GTM --data recoded_1000G.noadmixed.mat \
--test recoded_1000G_MXL.mat --labels recoded_1000G.raw.noadmixed.lbls3_3 \
--labeltype discrete --out outputname --pca --n_components 10 \
--missing --missing_strategy median \
--random_state 8 

This will give us:

  • predictions for individuals (output_indiv_predictions.csv)
  • posterior probabilities for each ancestry (output_indiv_probabilities.csv)
  • posterior probabilities for the whole test set (output_group_probabilities.csv)
  • a map with projected test set colored in black.

The projection for MXL population (Mexicans) can be visualized here: 1000G_GTM_projection_MXL.html

Addendum 1: map based on AFR superpopulation only

To construct t-SNE and GTM maps based on AFR populations:

  • Download:

  • Build GTM and t-SNE:

    python runGTM.py --model GTM --data recoded_1000G.noadmixed.AFR.mat \
    --labels recoded_1000G.raw.noadmixed.AFR.lbls3 --labeltype discrete \
    --out 1000G_GTM_AFR --pca --n_components 10 \
    --regularization 0.1 --rbf_width_factor 0.3 \
    --missing --missing_strategy median \
    --random_state 8 --ids recoded_1000G.raw.noadmixed.AFR.ids
    
    python runGTM.py --model GTM --data recoded_1000G.noadmixed.AFR.mat \
    --labels recoded_1000G.raw.noadmixed.AFR.lbls3 --labeltype discrete \
    --out 1000G_t-SNE_AFR --pca --n_components 10 \
    --missing --missing_strategy median \
    --random_state 8 --ids recoded_1000G.raw.noadmixed.AFR.ids
    
  • African subpopulations classification performance:

python runGTM.py --model GTM --data recoded_1000G.noadmixed.AFR.mat \
--labels recoded_1000G.raw.noadmixed.AFR.lbls3 \
--labeltype discrete --out outputname --pca --n_components 10 \
--missing --missing_strategy median \
--random_state 8 --crossvalidate

Addendum 2: Arabidopsis Thaliana geographic visualization

Cf. github repository https://github.com/hagax8/arabidopsis_viz

ancestry_viz's People

Contributors

hagax8 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

ancestry_viz's Issues

Legends and Tooltips

How do I add legends to the plot? And when I hover over the projected test data in the output html file, I don't see any tooltip coming, as I could see in the GTM and t-SNE plots

Error with matplotlib

`python runGTM.py --model GTM --data data/recoded_1000G.mat --labels data/recoded_1000G.raw.noadmixed.lbls3 --labeltype discrete --out data/1000G_GTMbuildV01 --pca --n_components 10 --regularization 0.1 --rbf_width_factor 0.3 --missing --missing_strategy median --random_state 8 --ids vj.1000g.ids

Namespace(alpha=0.5, cname='Spectral_r', crossvalidate=False, filenamedat='data/recoded_1000G.mat', filenameids='vj.1000g.ids', filenamelbls='data/recoded_1000G.raw.noadmixed.lbls3', filenametestids=None, grid_size=0, interpolate=False, kernel='euclidean', labeltype='discrete', missing=True, missing_strategy='median', model='GTM', n_components=10, n_neighbors=1, output='data/1000G_GTMbuildV01', pca=True, pointsize=1.0, predict_mode='bayes', prior='equiprobable', random_state=8, rbf_grid_size=0, rbf_width_factor=0.3, regularization=0.1, representation='modes', svm_epsilon=1.0, svm_gamma=1.0, svm_margin=1.0, test=None, usetest=None, verbose=False)

User provided model, data file and label names.

Used 10 components explaining 7.46560685665743% of the variance

k:17, m:4, regul:0.1, s:0.3
time taken for GTM: 22.735700845718384
Traceback (most recent call last):
File "/home/user/anaconda3/lib/python3.7/site-packages/matplotlib/colors.py", line 174, in to_rgba
rgba = _colors_full_map.cache[c, alpha]
KeyError: (6, None)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/user/anaconda3/lib/python3.7/site-packages/matplotlib/axes/_axes.py", line 4232, in scatter
colors = mcolors.to_rgba_array(c)
File "/home/user/anaconda3/lib/python3.7/site-packages/matplotlib/colors.py", line 275, in to_rgba_array
result[i] = to_rgba(cc, alpha)
File "/home/user/anaconda3/lib/python3.7/site-packages/matplotlib/colors.py", line 176, in to_rgba
rgba = _to_rgba_no_colorcycle(c, alpha)
File "/home/user/anaconda3/lib/python3.7/site-packages/matplotlib/colors.py", line 227, in _to_rgba_no_colorcycle
raise ValueError("Invalid RGBA argument: {!r}".format(orig_c))
ValueError: Invalid RGBA argument: 6

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "runGTM.py", line 509, in
prior=args.prior, do_interpolate=args.interpolate)
File "/home/user/anaconda3/lib/python3.7/site-packages/ugtm/ugtm_classes.py", line 459, in plot_multipanel
prior=prior)
File "/home/user/anaconda3/lib/python3.7/site-packages/ugtm/ugtm_plot.py", line 153, in plotMultiPanelGTM
edgecolor="black")
File "/home/user/anaconda3/lib/python3.7/site-packages/matplotlib/init.py", line 1810, in inner
return func(ax, *args, **kwargs)
File "/home/user/anaconda3/lib/python3.7/site-packages/matplotlib/axes/_axes.py", line 4245, in scatter
.format(nc=n_elem, xs=x.size, ys=y.size)
ValueError: 'c' argument has 1976 elements, which is not acceptable for use with 'x' with size 2504, 'y' with size 2504.
`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.