GithubHelp home page GithubHelp logo

lucidrains / enformer-tensorflow-sonnet-training-script Goto Github PK

View Code? Open in Web Editor NEW
15.0 2.0 7.0 126 KB

The full training script for Enformer - Tensorflow Sonnet

License: Apache License 2.0

Python 100.00%

enformer-tensorflow-sonnet-training-script's Introduction

Enformer TPU training script (wip)

The full training script for Enformer (Tensorflow Sonnet) on TPU clusters, in an effort to migrate the model to pytorch.

This was pieced together from the Deepmind Enformer repository, the colab training notebook, as well as Basenji sequence augmentation code

It accounts for:

  1. distributed TPU training
  2. distributed datasets
  3. distributed validation
  4. gradient clipping
  5. cross replica batchnorms
  6. dataset augmentation

Training takes about 3 days on v3-64

Downloading sequence data for extending context length to 196,608

$ gsutil cp gs://basenji_barnyard/hg38.ml.fa.gz ./ && gunzip hg38.ml.fa.gz
$ gsutil cp gs://basenji_barnyard/mm10.ml.fa.gz ./ && gunzip mm10.ml.fa.gz
$ gsutil cp gs://basenji_barnyard/data/human/sequences.bed ./human-sequences.bed
$ gsutil cp gs://basenji_barnyard/data/mouse/sequences.bed ./mouse-sequences.bed

Todo

  • fix script for differences in sequence length in basenji training data, which is ~130k vs ~190k bp as in paper - Training in progress

Citations

@article {Avsec2021.04.07.438649,
    author  = {Avsec, {\v Z}iga and Agarwal, Vikram and Visentin, Daniel and Ledsam, Joseph R. and Grabska-Barwinska, Agnieszka and Taylor, Kyle R. and Assael, Yannis and Jumper, John and Kohli, Pushmeet and Kelley, David R.},
    title   = {Effective gene expression prediction from sequence by integrating long-range interactions},
    elocation-id = {2021.04.07.438649},
    year    = {2021},
    doi     = {10.1101/2021.04.07.438649},
    publisher = {Cold Spring Harbor Laboratory},
    URL     = {https://www.biorxiv.org/content/early/2021/04/08/2021.04.07.438649},
    eprint  = {https://www.biorxiv.org/content/early/2021/04/08/2021.04.07.438649.full.pdf},
    journal = {bioRxiv}
}

enformer-tensorflow-sonnet-training-script's People

Contributors

lucidrains avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

enformer-tensorflow-sonnet-training-script's Issues

Gradient clipping: why not global norm ?

In the paper they say "We clipped gradients to a maximum global norm of 0.2."
In

gradients = [tf.clip_by_norm(grad, clip_grad_norm) for grad in gradients]
you choose to do simple clip_by_norm with a value of 1.0
I just wanted to ask the reasoning behind this choice, why did you not use tf.clip_by_global_norm instead ?

Just wanted to know if it's something you already tried and chose against as I struggle to train the model properly myself(using enformer_pytorch)

130k to 190k long sequence changes?

Hi, you mentioned that there needed to be some updates to the script to account for the increased sequence length in the dataset. I was wondering what kind of changes you were referring to (from a high level).

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.