statbiophys / sonnia Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 4.0 44.54 MB

License: GNU General Public License v3.0

Python 100.00%

sonnia's People

Contributors

Stargazers

Watchers

Forkers

giulioisac esobttt syntenybio zacmon

sonnia's Issues

nans obtained when training the model with very different pre and post dataset sizes

Description of the issue:

When training the model with n_pre "pre data" and n_post "post data", if n_pre >> n_post or n_post >> n_pre, the training completes without any error, but the loss and the likelihood values (of both the training and validation sets) are all nans.

Minimal example of the issue:

In this example I generate the pre and post sequences using soNNia itself, but the same happened with my actual dataset. I only show the case with n_pre >> n_post, but also the case n_pre << n_post gives the nans.

from sonnia.sonnia import SoNNia

# instantiate model
qm = SoNNia()

# prepare some example sequences
n_tot = 100000
qm.add_generated_seqs(num_gen_seqs = n_tot, reset_gen_seqs = True)

# prepare pre and post data
n_pre = 100000
n_post = 1000
seqs_post = qm.gen_seqs[:n_post]
seqs_pre = qm.gen_seqs[-n_pre:]

# model training (here I get nans)
qm.infer_selection(epochs = 20, batch_size = 1000, validation_split=0.05, verbose=1)

Error message:

No specific error message is given when the nans appear in the printed output of qm.infer_selection (with verbose=1).

Missing TRA V/J combinations in generated repertoires

Hi, I recently used SONNIA to generate large numbers of post-selection TRA and TRB (10 million each) with your default models for human_T_alpha and human_T_beta, respectively. For TCRa, we noticed that there are hundreds of V/J gene combinations that are never generated (despite high generation rates for the V and J genes individually). One example is TRAV3 and TRAJ41: V frequency is 0.3%, J frequency is 1.6%, for an expected frequency of around 0.0048%. However, out of 10 million TRA generated, where you would therefore expect roughly 480 TRA with this combination, we did not see any.

For this example (TRAV3 and TRAJ41), we checked VDJDB, and found 10 published human TRA chains with this combination. We also found VDJDB results for several other missing V/J combinations, suggesting they are not biologically forbidden.

Do you have an explanation for this? Is it an intentional feature of your model, or is it a technical artifact, for example if the TRA sequences used to train SONNIA did not contain that particular combination?

Thanks so much for any help!

Feature request: consolidating chain_type and custom_pgen_model into one argument: pgen_model

It's never quite made complete sense to me why there are two arguments for essentially the same thing. chain_type specifies the default OLGA models that can be used whereas custom_pgen_model specifies a custom OLGA model. In the past, loading a SONIA/SoNNia model didn't guarantee that the OLGA model in the SONIA/SoNNia folder would be loaded and custom_pgen_model had to be specified also as the SONIA folder. That appears to be remedied in this iteration of the software. However, I think it would be cleaner if chain_type and custom_pgen_model were merged into a single keyword: pgen_model.

The class initialization definition would appear as:

from typing import Iterable, List, Optional, Tuple
class Sonia(object):
    def __init__(self,
                 features: List[Iterable[str]] = [],
                 data_seqs: List[Iterable[str]] = [],
                 gen_seqs: List[Iterable[str]] = [],
                 load_dir: Optional[str] = None,
                 pgen_model: Optional[str] = None, # pgen_model keyword parameter here
                 max_depth: int = 25,
                 max_L: int = 30,
                 load_seqs: bool = True,
                 l2_reg: float = 0.,
                 l1_reg: float = 0.,
                 min_energy_clip: int = -5,
                 max_energy_clip: int = 10,
                 seed: Optional[int] = None,
                 vj: bool = False,
                 custom_pgen_model: Optional[str] = None,
                 processes: Optional[int] = None,
                 objective: str = 'BCE',
                 gamma: int = 1,
                 include_joint_genes: bool = False,
                 joint_vjl: bool = False,
                 include_aminoacids: bool =True
                ) -> None:
        # Other initializations unchanged.
        if load_dir is None and pgen_model is None:
            raise ValueError('Both load_dir and pgen_model cannot be None.')
        # Use pgen_model in the SONIA/SoNNia model folder.
        if pgen_model is None:
            self.pgen_model = load_dir
        else:
            self.pgen_model = pgen_model

utils.define_pgen_model is used to load the pgen model and contains the necessary checks:

import olga.generation_probability as generation_probability
import olga.sequence_generation as sequence_generation
import olga.load_model as olga_load_model
import olga.generation_probability as pgen
import olga.sequence_generation as seq_gen

DEFAULT_CHAIN_TYPES = {'humanTRA': 'human_T_alpha', 'human_T_alpha': 'human_T_alpha',
                       'humanTRB': 'human_T_beta', 'human_T_beta': 'human_T_beta',
                       'humanIGH': 'human_B_heavy', 'human_B_heavy': 'human_B_heavy',
                       'humanIGK': 'human_B_kappa', 'human_B_kappa': 'human_B_kappa',
                       'humanIGL': 'human_B_lambda', 'human_B_lambda': 'human_B_lambda',
                       'mouseTRB': 'mouse_T_beta', 'mouse_T_beta': 'mouse_T_beta',
                       'mouseTRA': 'mouse_T_alpha','mouse_T_alpha':'mouse_T_alpha'}

def define_pgen_model(pgen_model: str,
                      vj: bool = False,
                      return_files: bool = False
                     ):
    if pgen_model in DEFAULT_CHAIN_TYPES:
        pgen_model = DEFAULT_CHAIN_TYPES[pgen_model]
        filedir = os.path.dirname(os.path.abspath(__file__))
        main_folder = os.path.join(filedir, 'default_models', pgen_model)
    elif os.path.isdir(pgen_model):
        main_folder = pgen_model
    else:
        options = f'{list(DEFAULT_CHAIN_TYPES.keys())[::2]}'[1:-1]
        raise ValueError('pgen_model is neither a directory that exists '
                         f'nor one of the default options ({options}). '
                         'Try using one of the default options or an existing '
                         'directory containing the pgen model.')

    pgen_files = ('model_params.txt', 'model_marginals.txt',
                  'V_gene_CDR3_anchors.csv', 'J_gene_CDR3_anchors.csv')
    files_in_dir = set(os.listdir(pgen_model))
    missing_files = set(pgen_files) - files_in_dir

    if len(missing_files) > 0:
        missing_files = f'{missing_files}'[1:-1]
        raise RuntimeError('The pgen model cannot be loaded. The following files '
                           f'are missing: {missing_files}.')

    params_file_name = os.path.join(main_folder, pgen_files[0])
    marginals_file_name = os.path.join(main_folder, pgen_files[1])
    V_anchor_pos_file = os.path.join(main_folder, pgen_files[2])
    J_anchor_pos_file = os.path.join(main_folder, pgen_files[3])

    if vj:
        model_str = 'VJ'
    else:
        model_str = 'VDJ'

    genomic_data = getattr(olga_load_model, f'GenomicData{model_str}')()
    genomic_data.load_igor_genomic_data(params_file_name,
                                        V_anchor_pos_file,
                                        J_anchor_pos_file)
    generative_model = getattr(olga_load_model, f'GenerativeModel{model_str}')()
    generative_model.load_and_process_igor_model(marginals_file_name)
    pgen_model = getattr(pgen, f'GenerationProbability{model_str}')(generative_model, genomic_data)
    seqgen_model = getattr(seq_gen, f'SequenceGeneration{model_str}')(generative_model, genomic_data)

    to_return = (genomic_data, generative_model, pgen_model, seqgen_model)

    if return_files:
        return to_return + (params_file_name, marginals_file_name, V_anchor_pos_file, J_anchor_pos_file)
    return to_return

I show these two examples to give an idea of what this would look like. Corrections elsewhere in the software would be made accordingly if accepted.

I think this would enhance the software and remedy confusion that's, at the very least, happened to me.

how to compare repertoires

Hello,
Is it currently possible to compare repertoires using the JSD function implemented in your pipeline?
I see that you have a method to compute it however I can't seem to find any explanation on how to use it.
Thanks!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble