GithubHelp home page GithubHelp logo

songlab-cal / tape-neurips2019 Goto Github PK

View Code? Open in Web Editor NEW
118.0 9.0 34.0 139 KB

Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. (DEPRECATED)

Home Page: https://arxiv.org/abs/1906.08230

License: MIT License

Python 97.52% TeX 1.69% Shell 0.80%
deep-learning protein-sequences protein-structure semi-supervised-learning benchmark language-modeling dataset

tape-neurips2019's People

Contributors

cthoyt avatar mlgill avatar nickbhat avatar rmrao avatar thomas-a-neil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tape-neurips2019's Issues

About Bepler's baseline performance

Hi, in the paper, I noticed that the Bepler's model was train with next-token prediciton and there are two pre-trained models for Bepler (one is unsupervised and another is multitask version), which one is used as baseline in the paper?

Thanks!

Freeze Weights

Hello,

I can see from the Training Details in the paper that during supervised fine-tuning backpropagation was through the entire model including the language model portion. I also see from the code that you had some functionality for freezing weights. I was curious what magnitude you saw between freezing or training the language model portion during the supervised fine-tuning if you did that, especially for the Transformer.

Thanks again!

Scott

Error when training is done

Hi, I used the following command to train a fluorescence task with pretrained transformer weights(I used this config for my test run):
tape with model=transformer tasks=fluorescence gpu.device=0 load_from=pretrained_models/transformer_weights.h5 num_epochs=1 steps_per_epoch=10

When the training was done, it received a error. The traceback is as follows:

Traceback (most recent calls WITHOUT Sacred internals):
  File "/work01/home/wxxie/project/biolang/tape/tape/__main__.py", line 330, in main
    train_metrics = train_graph.run_for_n_steps(_config['steps_per_epoch'], epoch_num=epoch)
  File "/work01/home/wxxie/conda-env/tape/lib/python3.6/site-packages/rinokeras/core/v1x/train/RinokerasGraph.py", line 196, in run_for_n_steps
    self.run('default')
  File "/work01/home/wxxie/conda-env/tape/lib/python3.6/site-packages/rinokeras/core/v1x/train/RinokerasGraph.py", line 132, in __exit__
    self.progress_bar.__exit__()
TypeError: __exit__() missing 3 required positional arguments: 'exc_type', 'exc_value', and 'traceback'

It seems to me the error comes from rinokeras but I am new to it and I have installed the required version of rinokeras. Could you help?

Thanks

Bruce

Global Arguments question

In @proteins.config in __main__.py there are two default global arguments set as follows:

    freeze_embedding_weights = False  # noqa: F841
    save_outputs = False  # noqa: F841

Can you tell me what freeze_embedding_weights does? If I want the pretrained unsupervised weights for the fluorescence task to remain fixed (ie. the embeddings will be constant) should I switch this to True?

For save_outputs, if this is False will outputs be saved? [See README.md|Saving Results - not sure if there is a discrepency here.] If 'save_outputs' is True what gets saved?

Thanks.

Scott

Example on how to add a new Task

Hello,

I'm interested in being able to add a new task, specifically a SequenceToFloatTask, to fine-tune the various models pre-trained on PFAM on the specific task at hand and evaluate their performance. Unfortunately, while reading through the source code I was unable to determine how the datasets are correlated with the tasks, and so was unsure how to figure this out on my own. I see that on the Task ABC there is a get_data() method that uses a data folder to load the specific files, and on main there is an additional get_data() function that receives a data folder as input and passes this along to each task's get_data() method, however on the eval() protein.command and main() protein.automain, both in main, this get_data() function is called without specifying any data_folder. If the way to specify/correlate a dataset to a specific task could be explained, then I might be able to figure this out on my own; however in either case an example of adding a new task would be really helpful!

Best,
Chase

how to debug in pycharm using this project

the old keras project start in terminal,it is difficult to find the main file and due to the sacred control the logging and configure ,It is hard to using the dubug mode in pycharm? any advice ? or using the pytorch version ?

Rename and deploy to PyPI

After reorganizing the code in #1, it would be a good idea to deploy this code to PyPI to even further support reproducibility. However, there is already a package named tape, so it would be necessary to rename first...

Pretrained supervised task-specific weights for fluorescence task?

Hello,

I'm trying to run the fluorescence task with fully pretrained model weights (ie. for both the unsupervised pretraining and weights for the supervised task as well). I have downloaded the pretrained UniRep model but I'm thinking this does not include the model weights for the supervised task (fluorescence). Are these available or am I confusing something?

In the README.md|Loading a Model there is a bit about loading supervised task-specific weights but it wasn't clear to me.

Thanks for any guidance you have.

Scott

Eval error: CUDA_ERROR_OUT_OF_MEMORY

I have trained the fluorescence task model with:

!tape with model=unirep tasks=fluorescence load_from='pretrained_models/unirep_weights.h5' freeze_embedding_weights=True steps_per_epoch=100 datafile='data/fluorescence/fluorescence_train.tfrecords'

[Note used very small steps_per_epoch so it would train in a reasonable time so I can just get something working.]

Next I tried to evaluate the model using:

!tape-eval results/fluorescence_unirep_2019-07-30--17-22-15/ --datafile data/fluorescence/fluorescence_test.tfrecord'

but after only a few iterations the GPU memory use just explodes:

Model Parameters: 18219415 
Loading task weights from results/fluorescence_unirep_2019-07-30--17-22-15/task_weights.h5
Saving outputs to results/fluorescence_unirep_2019-07-30--17-22-15/outputs.pkl
37it [00:20,  2.00it/s, Loss=0.75, MAE=0.57]2019-07-30 19:48:39.226378: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-07-30 19:48:39.226439: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 8589934592

I'm running on a Tesla T4 with 14GB of memory (Google Colab).

The memory explosion would appear to be in
test_metrics = test_graph.run_epoch(save_outputs=outfile)

Any suggestions on how to resolve?

Thanks.

Scott

Use entrypoints for CLI

setuptools allows some nice configuration so you can transform python -m tape into simply tape.

I would also suggest moving the two evaluation scripts inside the repo as well, and using other entrypoints to access them. I will submit a PR and explain more!

Evaluation on contact prediction with binary metrics

Hello,
In the script tape/analysis/contact_prediction/evaluate_contact_prediction_metrics.py at the line 62 there is metric(label, prediction), but in the true label we will also have* -1 when the position is not correctly determined in the dataset(from valid_mask column), so the default parameter average='binary' will not work. So Is there something I have missed ?

*label possible value in my understanding : 1 : contact, 0: not in contact, -1 : not enough informations

The exact error : ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'] at line metrics.append(metric(label, prediction)).

Citation for NetSurfP2.0

Since the data for secondary structure was curated in this paper, should people cite them in addition to the PDB when using that dataset?

How Bepler's Multitask Model trained

Hi, I found there are well-defined Bepler's supervised pre-training tasks and the released model weights. But I'm really confused about how was it trained? Because I could not find a task that combine the original two supervised tasks.

If anybody has any suggestions, thanks in advance!

Embedding tfrecords doesn't work

Tensorflow complains about the datafile being passed as a PosixPath. this is fixed if you cast the path to a string in run_embed.py

amino acid mapping

Can the authors provide the mapping from index number in raw data to three letter amino acid names?

I'm assuming it is alphabetical starting from 'A'-> 4 (skipping the letter 'J'). in addition to ordering, please give clarification on full amino acid name

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.