GithubHelp home page GithubHelp logo

jmschrei / avocado Goto Github PK

View Code? Open in Web Editor NEW
113.0 113.0 20.0 161.37 MB

Avocado is a multi-scale deep tensor factorization model that learns a latent representation of the human epigenome and enables imputation of epigenomic experiments that have not yet been performed.

License: Other

Jupyter Notebook 88.85% Python 11.15%

avocado's People

Contributors

jmschrei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

avocado's Issues

Cannot import name 'Adam' from 'keras.optimizers'

I tried to install and use avocado on google colab but I got this error when import from avocado import Avocado

/usr/local/lib/python3.7/dist-packages/avocado/model.py in <module>()
     21 from keras.layers import Multiply, Dot, Flatten, concatenate
     22 from keras.models import Model
---> 23 from keras.optimizers import Adam
     24 
     25 def build_model(n_celltypes, n_celltype_factors, n_assays, n_assay_factors,

ImportError: cannot import name 'Adam' from 'keras.optimizers' (/usr/local/lib/python3.7/dist-packages/keras/optimizers.py)

Can we use from tensorflow.keras.optimizers import Adam instead?

Validation errors are too high

Hello,

I use the training data to train the model, with one of them to be the validation set.
But the MSE are too high for validation even the training MSE is quite well.

Here is my code and the result pictures:

from __future__ import print_function

import os
import sys
import time
import numpy, itertools
from avocado import Avocado
import matplotlib

matplotlib.use("agg")
import matplotlib.pyplot as plt

celltypes = ['E003', 'E017', 'E065', 'E116', 'E117']
assays = ['H3K4me3', 'H3K27me3', 'H3K36me3', 'H3K9me3', 'H3K4me1']

data = {}
for celltype, assay in itertools.product(celltypes, assays):
    if celltype == 'E003'  and  assay == 'H3K4me3':
        continue
    filename = '/home/ey712185/data/{}.{}.pilot.arcsinh.npz'.format(celltype, assay)
    data[(celltype, assay)] = numpy.load(filename)['arr_0']

model = Avocado(celltypes, assays)

start_time = time.time()

data_validation = {}
filename_v = '/home/ey712185/data/E003.H3K4me3.pilot.arcsinh.npz'
data_validation[(celltype, assay)] = numpy.load(filename_v)['arr_0']

history = model.fit(data, data_validation, n_epochs = 600)

end_time = time.time()

running_time = (end_time - start_time) / 3600.0

print("running time {}".format(running_time))

# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.savefig("avocado_cy_E003.pdf")

model.save("avocado_cy")

the first time I use (E065, H3K4me3), second time I use (E003, H3K4me3)
The MSE for validation are always about 0.5 - 0.6, with the MSE of training is less than 0.05.
E065, H3K4me3

E003, H3K4me3

Thanks in advance!

Information/documentation on data used to train model and required inputs

Hello,

I have 2 questions:

  1. I was wondering if there was documentation available on how you created the data used to train models that are available (ie what the npy file format is, what files were used as input, and available code to perform the transformation)? This is specifically useful if you want to apply Avocado to a new dataset.
  2. Additionally, I was wondering if the transformed data used to train the full models was available somewhere. The data available in /data wouldn't be sufficient to train anything more than a toy model.

Thanks in advance.

Cannot run test case

When I run the command model.fit(data, n_epochs=10, epoch_size=100) I get the following error:

/media/data/daquang/Software/avocado/avocado/model.py:443: UserWarning: Update your fit_generator call to the Keras 2 API: fit_generator(<generator..., 100, 10, workers=1, verbose=1, callbacks=None, use_multiprocessing=True)
callbacks=callbacks, **kwargs)
Epoch 1/10
Traceback (most recent call last):
File "/home/daquang/anaconda3/lib/python3.6/site-packages/keras/utils/data_utils.py", line 677, in _data_generator_task
generator_output = next(self._generator)
File "/media/data/daquang/Software/avocado/avocado/io.py", line 55, in sequential_data_generator
value[i] = tracks[idx][genomic_25bp_idxs[i]]
TypeError: 'dict_values' object does not support indexing

Can you also include a script to convert raw files (eg bigwigs) into the appropriate npz files? This will be useful if we want to run our own models.

Confirmation of n_genomic_positions

Hey Jacob,

Just looking for a quick confirmation on the pilot region size used to train Avocado. I can see the number of genomic positions (n_genomic_positions) is 1126469 but is each position a 25 bp average? So then the total region passed to the model in one go is 28,161,725 bps? I think this makes sense since you divide by 10 and 200 for the 250bp and 5kbp embeddings.

Cheers,
Alan.

After adding new cell types, the size of the new model file is smaller than that of the corresponding ENCODE model file

I am faced with a problem. I want to add my own NPC cell types (e.g., C15, C17, C17C ... X2117) to the existing models. Finally, I found that the size of the new model file is smaller than that of the corresponding ENCODE model file.
image
image
I also try to load the newly generated model file and find that the NPC cell types are indeed added to the model.
image

The following is my code for training the model.

import os, sys

os.environ["THEANO_FLAGS"] = "device=cuda0"
import matplotlib.pyplot as plt
import seaborn

seaborn.set_style("whitegrid")
import itertools
import numpy

numpy.random.seed(0)
from avocado import Avocado

import pandas as pd
import argparse
import math


parser = argparse.ArgumentParser(description="Train a new model")
parser.add_argument(
    "chrom", type=str, help="Specify the chromosome that training is performed in"
)
parser.add_argument(
    "--chromSize",
    action="store",
    dest="chromSize",
    type=str,
    default="./hg38.chrom.sizes",
    help="The file storing the chrom sie information",
)
parser.add_argument(
    "--batchsize",
    action="store",
    dest="batchsize",
    type=int,
    default=40000,
    help="Batch size for neural network predictions.",
)
args = parser.parse_args()

chrom_size = pd.read_table(args.chromSize, sep="\t", names=["chr", "size"])
chrom_size.set_index(["chr"], inplace=True)

celltypes = [
    "C15",
    "C17",
    "C17C",
    "C666-1",
    "NP460",
    "NP460_EBV",
    "NP69",
    "NP69_EBV",
    "NPC23",
    "NPC32",
    "NPC43",
    "NPC43noEBV",
    "NPC53",
    "NPC76",
    "X2117",
]
assays = [
    "ChIP-seq_H3K27ac_signal_p-value",
    "ChIP-seq_H3K4me1_signal_p-value",
    "ChIP-seq_H3K4me3_signal_p-value",
]

data = {}
for celltype, assay in itertools.product(celltypes, assays):
    filename = (
        "./signals/{}/{}/{}.{}.pval.signal.bw.{}.npz".format(celltype, assay.split("_")[1], celltype, assay.split("_")[1], args.chrom)
    )
    print(filename)
    data[(celltype, assay)] = numpy.load(filename)[args.chrom]

model = Avocado.load("./avocado/.encode2018core-model/avocado-" + args.chrom)
size = chrom_size.loc[args.chrom]["size"]
model.fit_celltypes(data, epoch_size=math.ceil(size / args.batchsize), n_epochs=200)

model.save("./model/NPC_" + args.chrom)

Error in loading pre-trained data

Hello,
when I run the code, I got error like following.
I've already downloaded the corresponding file and tensorflow.
Does anyone have the same problem and how to deal?

from avocado import Avocado
Using TensorFlow backend.
model = Avocado.load('avocado-chr19')
WARNING:tensorflow:From /home/ying/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
Traceback (most recent call last):
File "", line 1, in
File "/home/ying/Cluster/home/avocado/model.py", line 1030, in load
**d)
File "/home/ying/Cluster/home/avocado/model.py", line 231, in init
freeze_network=freeze_network)
File "/home/ying/Cluster/home/avocado/model.py", line 53, in build_model
genome_250bp = Flatten()(genome_250bp_embedding(genome_250bp_input))
File "/home/ying/anaconda3/lib/python3.7/site-packages/keras/engine/base_layer.py", line 431, in call
self.build(unpack_singleton(input_shapes))
File "/home/ying/anaconda3/lib/python3.7/site-packages/keras/layers/embeddings.py", line 109, in build
dtype=self.dtype)
File "/home/ying/anaconda3/lib/python3.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/ying/anaconda3/lib/python3.7/site-packages/keras/engine/base_layer.py", line 249, in add_weight
weight = K.variable(initializer(shape),
File "/home/ying/anaconda3/lib/python3.7/site-packages/keras/initializers.py", line 112, in call
dtype=dtype, seed=self.seed)
File "/home/ying/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 4139, in random_uniform
dtype=dtype, seed=seed)
File "/home/ying/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/random_ops.py", line 247, in random_uniform
rnd = gen_random_ops.random_uniform(shape, dtype, seed=seed1, seed2=seed2)
File "/home/ying/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/gen_random_ops.py", line 777, in random_uniform
name=name)
File "/home/ying/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 610, in _apply_op_helper
param_name=input_name)
File "/home/ying/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 60, in _SatisfiesTypeConstraint
", ".join(dtypes.as_dtype(x).name for x in allowed_list)))
TypeError: Value passed to parameter 'shape' has DataType float32 not in list of allowed values: int32, int64

Tensorflow Integration

It looks like Tensorflow is not listed as a package requirement, although the package will fail to run without it. Additionally, if tensorflow is installed, line 23 in models.py throws an error about Adam not being in the keras package.

I commented out the Adam line and once I did it ran without errors. However, I run out of memory when trying to impute some histone data. Specifically, it fails to set batch data. I am not sure why this is happening since I am not training a new model.

Non-human species

Can use this software in non-human species, e.g. cattle? If yes, how I can build pre-trained model? If human model can be extended to other species?

Num of parameters

If I set n_genomic_positions to be the length of chromosome, and n_25bp_factors to be 25 by default, then the number of paramter in this layer will be 25 * len(chr), which is really large. Should I only train these parameters on the pilot region only? Then how to adopt these parameters to the whole chromosome (since the numbers of parameter are different, 25 * len(chr) v.s. 25*len(pilot))? What should be the right way?

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.