GithubHelp home page GithubHelp logo

kipoi / kipoi-veff Goto Github PK

View Code? Open in Web Editor NEW
6.0 6.0 5.0 42.19 MB

Variant effect prediction plugin for Kipoi

Home Page: https://kipoi.org/veff-docs

License: MIT License

Python 47.23% Jupyter Notebook 52.25% Makefile 0.19% Shell 0.33%

kipoi-veff's Issues

indel support: dataloader-utility: VariantSeqExtractor

Definition of a class that can deals with generation of mutated DNA sequences given an interval, a variant and an anchor point.

class VariantSeqExtractor:
  def __init__(self, fasta_file):
      …

  def extract(self, interval, variant, anchor, keep_length=True):
     “””
       interval: define a new class Interval()
       Variant: define a a new class Variant() in analogy to cyvcf2
       keep_length: returned sequence will have the same length as the interval
             ...[AGAC|AGATG]C...
                 [     interval       ]
                            |  anchor point
                                GA -> G variant
                   If not keep_length:
                  return (“AGAC|AGATG”, “AGACAGTG”)
       else:
                return (“AGAC|AGATG”, “AGACAGTGC”)  

     Returns
        A tuple of sequences with ref and alt
     “””
     pass

indel support: dataloader-utility: MutationDatasetMixin

A mixin to kipoi.data.*Dataset classes, which defines a function that returns model input for both alleles (and reverse-complementation).

The difference for the MutationDatasetMixin-methods is that the key inputs will be replaced with the keys: inputs_ref, inputs_alt (optionally also: inputs_ref_rc, inputs_alt_rc). Which all contain the identical structure and their data corresponds to the reference and alterenative (optionally also in reverse-complement) of the model input data.

So additionally the current Dataloader output schema:

{ 
   "inputs": <some_obj>, 
   "targets": <some_obj>, 
   "metadata": {...}
}

there will be a method returning dictionaries of:

{ 
    "inputs_ref": <some_obj>, 
    "inputs_alt": <some_obj>, 
    "inputs_ref_rc": <some_obj>, 
    "inputs_alt_rc": <some_obj>, 
    "targets": <some_obj>, 
    "metadata": {...}
}

All relationships between inputs and metadata etc. have to hold identically for the newly defined inputs_* keys.

Implementation timeline

Core functionality

  • Copy core variant effect prediction functionality from Kipoi
  • Resolve variant-effect related issues here (models with multidimensional outputs, output selection, etc.)
  • Implement different solutions of indel variant effect prediction

Interaction with Kipoi

  • CLI needs to be loadable from kipoi_veff package
  • yaml parsing classes need to be loadable from kipoi_veff package

dataloader_args

Hi,

I am trying to test the kipoi-veff cli taking by example the command in the Kipoi paper, but I get an error:

Traceback (most recent call last):
  File "/Users/gisela/anaconda3/envs/kipoi-veff/bin/kipoi", line 10, in <module>
    sys.exit(main())
  File "/Users/gisela/anaconda3/envs/kipoi-veff/lib/python3.6/site-packages/kipoi/__main__.py", line 104, in main
    command_fn(args.command, sys.argv[2:])
  File "/Users/gisela/anaconda3/envs/kipoi-veff/lib/python3.6/site-packages/kipoi_veff/__main__.py", line 11, in cli_main
    kipoi_veff.cli.cli_main(command, raw_args)
  File "/Users/gisela/anaconda3/envs/kipoi-veff/lib/python3.6/site-packages/kipoi_veff/cli.py", line 458, in cli_main
    command_fn(args.command, raw_args[1:])
  File "/Users/gisela/anaconda3/envs/kipoi-veff/lib/python3.6/site-packages/kipoi_veff/cli.py", line 180, in cli_score_variants
    dataloader_arguments = parse_json_file_str_or_arglist(args.dataloader_args)
  File "/Users/gisela/anaconda3/envs/kipoi-veff/lib/python3.6/site-packages/kipoi_utils/utils.py", line 282, in parse_json_file_str_or_arglist
    raise RuntimeError("wrong usage, dataloader_args must be a list")
RuntimeError: wrong usage, dataloader_args must be a list

My command:

kipoi veff score_variants DeepSEA --dataloader_args '{"fasta_file": "~/.vep/homo_sapiens/97_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa"}' -i 'input.vcf' -o 'output.vcf'

I've also tested:

kipoi veff score_variants DeepSEA --dataloader_args='{"fasta_file": "~/.vep/homo_sapiens/97_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa"}' -i 'input.vcf' -o 'output.vcf'

to no avail.

Speedup writing to file

- [x] buffer writes - #21 (e.g. don't write predictions to disk on every batch but only every now and then)

  • this should be implemented in #21

- [ ] use asynchronous writes

Here is the main loop performing:

  1. data-loading
  2. data preparation for the model
  3. model prediction
  4. Prediction writing to file

https://github.com/kipoi/kipoi-veff/blob/master/kipoi_veff/snv_predict.py#L620-L658

    for i, batch in enumerate(tqdm(it)):
        ...
        # Step 1. load the data
        eval_kwargs = _generate_seq_sets(dataloader.output_schema, batch, vcf_fh, vcf_id_generator_fn,
                                         seq_to_mut=seq_to_mut, seq_to_meta=seq_to_meta,
                                         sample_counter=sample_counter, vcf_search_regions=vcf_search_regions,
                                         generate_rc=model_info_extractor.use_seq_only_rc,
                                         bed_id_conv_fh=bed_id_conv_fh)

        # Step 2.  data preparation for the model
        if generated_seq_writer is not None:
            for writer in generated_seq_writer:
                writer(eval_kwargs)
            # Assume that we don't actually want the predictions to be calculated...
            continue

        if evaluation_function_kwargs is not None:
            assert isinstance(evaluation_function_kwargs, dict)
            for k in evaluation_function_kwargs:
                eval_kwargs[k] = evaluation_function_kwargs[k]

        eval_kwargs["out_annotation_all_outputs"] = model_out_annotation


        # Step 3. Make model prediction
        res_here = evaluation_function(model, output_reshaper=out_reshaper, **eval_kwargs)
     
        ....

        # Step 4. write the predictions
        if sync_pred_writer is not None:
            for writer in sync_pred_writer:
                writer(res_here, eval_kwargs["vcf_records"], eval_kwargs["line_id"])
  • this is the main loop performing model prediction

- [ ] setup some standardized benchmarks to test the overhead

Tasks

Follow the following notebook: https://github.com/kipoi/kipoi-veff/blob/write_buffer/notebooks/code-profiling.ipynb

Finish the code on the write buffer PR by speeding up the writing to take minimal amount of time.

Error when chromosome names don't start with `chr`

  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi_veff/snv_predict.py", line 795, in score_variants
    return_predictions=return_predictions)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi_veff/snv_predict.py", line 620, in predict_snvs
    for i, batch in enumerate(tqdm(it)):
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/tqdm/_tqdm.py", line 979, in __iter__
    for obj in iterable:
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 175, in __next__
    return self._process_next_batch(batch)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 195, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
pyfaidx.FetchError: Traceback (most recent call last):
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 639, in from_file
    i = self.index[rname]
KeyError: 'chr1'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 58, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 58, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoiseq/dataloaders/sequence.py", line 350, in __getitem__
    ret = self.seq_dl[idx]
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoiseq/dataloaders/sequence.py", line 238, in __getitem__
    seq = self.fasta_extractors.extract(interval)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoiseq/extractors.py", line 50, in extract
    seq = str(self.fasta.get_seq(interval.chrom, interval.start + 1, interval.stop, rc=rc).seq)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 1032, in get_seq
    seq = self.faidx.fetch(name, start, end)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 624, in fetch
    seq = self.from_file(name, start, end)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 642, in from_file
    "Please check your FASTA file.".format(rname))
pyfaidx.FetchError: Requested rname chr1 does not exist! Please check your FASTA file.

minimal.vcf

##fileformat=VCFv4.0
##fileDate=20181110
##source=UKBB/variants.tsv.bgz_V3
##reference=GRCh37
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1    15791   1:15791_C_T     C       T       .       .       .
1    69487   1:69487_G_A     G       A       .       .       .
1    69569   1:69569_T_C     T       C       .       .       .
1    139853  1:139853_C_T    C       T       .       .       .
1    693731  1:693731_A_G    A       G       .       .       .

Fasta file contained the correct chromosome names. Eg. >1...

compatability issues with installation of deepsea

Screen Shot 2021-01-25 at 5 43 29 PM
I am trying to install Kipoi-Veff in conjunction with DeepSea and seem to be having problems. I am using an AWS instance with the deep learning AMI. I simply activate the python3 + cuda 10.0 environment and then install kipoi-veff using conda. This seems to install kipoi and kipoi-veff but then when I then try to install DeepSea into the same environment

kipoi env install DeepSEA/predict --gpu

I am getting compatibility issues which eventually break things.

If I am trying to install everything from scratch, how would you recommend I install Kipoi, Kipoi-veff, Basenji, and DeepSea?

update ModelInfoExtractor

Requirements by the refactoring of model dataloaders imply that the input sequence length and shape cannot be derived from the Dataloader class, but must be derived from the model object.

This refactoring may also cause that the dataloader only returns a numpy array in ['inputs'], but the model defines a list or dictionary of length 1 as input.

indel support: predict_mutations_on_batch()

In analogy to predict_on_batch() kipoi_veff should enable a predict_mutations_on_batch() function that will use the inputs_ref, inputs_alt (inputs_ref_rc, inputs_alt_rc) keys returned from the MutationDatasetMixinobject to produce 2 (or 4) model prediction outputs.

This may be designed as a mixin for kipoi.model.Model classes:

class PredictMutationsMixin(object):
    def predict_mutations_on_batch(self, x):
        return [self.predict_on_batch(x[k]) for k in x if k.startswith("inputs_")]

indel support overview

We envision two different ways to implement variant effect prediction for indels:

  1. A dataloader-based approach that relies on the implementation of a dataloader method that returns full sets of model inputs for the reference and the alternative (+ r.c.) allele of a variant.
  2. A wrapper-based approach similar to how SNVs are handled. Indels will be handled by generating a short reference sequences for the individual alleles, modifying all dataloader input files accordingly. This is envisioned to be performed using g2gtools (there have been efforts recently to enable it for python3)

Results differ unexpectedly between `pipeline.predict` and `query_bed` with `ref` score

I'm using the DeepSEA/variantEffects model with MutationMap to try and find the most impactful mutations for a set of sequences. However, I've noticed a discrepancy between the predictions I get for the wild-type sequences when using pipeline.predict vs. query_bed with the ref score.

The pipeline.predict scores are generally high probabilities for CTCF in cell type A549, which is what I'd expect given that my bed file consists of ChIP-seq peaks for that TF/cell line pair. The ref scored predictions, on the other hand, tend to be really small. A few examples of the difference are:

Predictions from pipeline.predict: [0.96582216 0.03907571 0.09172686 0.19638198 0.13311383 0.64790106
 0.67463433 0.94372396 0.64945625 0.98188872]

		vs.

Predictions from `ref`-scored MutationMap.query_bed: [-9.167706593871117e-09, -3.85800376534462e-07, 2.0929292077198625e-08, 3.2794196158647537e-07, 3.2247044146060944e-08, 2.3366883397102356e-06, 0.0, 1.3096723705530167e-09, 1.1496013030409813e-09, 0.0]

I'm trying to understand whether this is expected, a result of something I'm doing wrong, or a bug. For context, my code is essentially the following (pared down to make it easier to see the essentials):

dl_kwargs = {'fasta_file': '../dat/hg19.fa'}
predict_dl_kwargs = {'fasta_file': '../dat/hg19.fa', 'intervals_file': random_seqs_fpath}
random_seqs_fpath = "../dat/ChIPseq.A549.CTCF.100.random.narrowPeak.gz"
deepsea = kipoi.get_model("DeepSEA/variantEffects", source="kipoi")

# Predictions from MutationMap
dl_kwargs = {'fasta_file': '../dat/hg19.fa'}
mm = MutationMap(deepsea, deepsea.default_dataloader, dataloader_args=dl_kwargs)
mmp = mm.query_bed(random_seqs_fpath, scores=['ref', 'diff'])
mutation_map_predictions = [
    mmp.mutation_map[i]['seq']['ref']['A549_CTCF_None_720']['mutation_map']
    for i in range(len(mmp.mutation_map)
]
# I also do some post-processing to just get one of the non-zero values from the 4 x 1000 mutation map.

# Predictions from pipeline.predict
pipeline_predictions = deepsea_predict.pipeline.predict(predict_dl_kwargs, batch_size=100)

Note that both of these use the exact same bed file and therefore should be looking at the same sequences.

Am I missing some key reason why I should expect these two prediction arrays to differ dramatically? I am using the default rc merge settings for both (and that wouldn't account for the order-of-magnitude differences anyway). The two best ideas I have for why I might be getting such different results are:

  1. Contrary to my understanding of the docs, MutationMap.query_bed is re-centering on each currently-being-tested variant.
  2. ref doesn't mean what I think it does.

Docs improvement

  • don't include long tqdm printing to ipynb's
03:30,  1.44it/s]�[A
304it [03:31,  1.44it/s]�[A
305it [03:31,  1.44it/s]�[A
306it [03:32,  1.44it/s]�[A
307it [03:33,  1.44it/s]�[
...
  • split the individual long documents into multiple logical sections / units
    • split overview.md to the part relevant for contributing the model and part relevant for using the models
  • rename the .md documents (e.g. overview etc)

indel support: dataloader-utility:VcfReader

The VCF reader class is designed to return variant objects compatible with VariantSeqExtractor.extract. It can either be used as an iterator starting from the beginning of the VCF file or an iterator for variants in a genomic regions defined by overlap. This functionality is all built in cyvcf2 so VcfReader is essentially a wrapper that converts vcf records to instances of our own variant class.

Getting a NotImplementedError: "sortBed" does not appear to be installed or on the path... error when running score_variants

Hey there,

Pretty new to Kipoi and was explore this particular plugin. Unfortunately, whenever I try to run the score_variants function, I get the following error. Not sure if this is something others have faced in the past, but would be grateful if you could suggest a prospective solution! Thank you!

Error:

NotImplementedError: "sortBed" does not appear to be installed or on the path, so this method is disabled. Please install a more recent version of BEDTools and re-import to use this method.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.