songlab-cal / tape-neurips2019 Goto Github PK

Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. (DEPRECATED)

Home Page: https://arxiv.org/abs/1906.08230

License: MIT License

Python 97.52% TeX 1.69% Shell 0.80%

deep-learning protein-sequences protein-structure semi-supervised-learning benchmark language-modeling dataset

tape-neurips2019's People

Contributors

Stargazers

Watchers

tape-neurips2019's Issues

About Bepler's baseline performance

Hi, in the paper, I noticed that the Bepler's model was train with next-token prediciton and there are two pre-trained models for Bepler (one is unsupervised and another is multitask version), which one is used as baseline in the paper?

Thanks!

Freeze Weights

Hello,

I can see from the Training Details in the paper that during supervised fine-tuning backpropagation was through the entire model including the language model portion. I also see from the code that you had some functionality for freezing weights. I was curious what magnitude you saw between freezing or training the language model portion during the supervised fine-tuning if you did that, especially for the Transformer.

Thanks again!

Scott

Error when training is done

Hi, I used the following command to train a fluorescence task with pretrained transformer weights(I used this config for my test run):
tape with model=transformer tasks=fluorescence gpu.device=0 load_from=pretrained_models/transformer_weights.h5 num_epochs=1 steps_per_epoch=10

When the training was done, it received a error. The traceback is as follows:

Traceback (most recent calls WITHOUT Sacred internals):
  File "/work01/home/wxxie/project/biolang/tape/tape/__main__.py", line 330, in main
    train_metrics = train_graph.run_for_n_steps(_config['steps_per_epoch'], epoch_num=epoch)
  File "/work01/home/wxxie/conda-env/tape/lib/python3.6/site-packages/rinokeras/core/v1x/train/RinokerasGraph.py", line 196, in run_for_n_steps
    self.run('default')
  File "/work01/home/wxxie/conda-env/tape/lib/python3.6/site-packages/rinokeras/core/v1x/train/RinokerasGraph.py", line 132, in __exit__
    self.progress_bar.__exit__()
TypeError: __exit__() missing 3 required positional arguments: 'exc_type', 'exc_value', and 'traceback'

It seems to me the error comes from rinokeras but I am new to it and I have installed the required version of rinokeras. Could you help?

Thanks

Bruce

Global Arguments question

In @proteins.config in __main__.py there are two default global arguments set as follows:

    freeze_embedding_weights = False  # noqa: F841
    save_outputs = False  # noqa: F841

Can you tell me what freeze_embedding_weights does? If I want the pretrained unsupervised weights for the fluorescence task to remain fixed (ie. the embeddings will be constant) should I switch this to True?

For save_outputs, if this is False will outputs be saved? [See README.md|Saving Results - not sure if there is a discrepency here.] If 'save_outputs' is True what gets saved?

Thanks.

Scott

Example on how to add a new Task

Hello,

I'm interested in being able to add a new task, specifically a SequenceToFloatTask, to fine-tune the various models pre-trained on PFAM on the specific task at hand and evaluate their performance. Unfortunately, while reading through the source code I was unable to determine how the datasets are correlated with the tasks, and so was unsure how to figure this out on my own. I see that on the Task ABC there is a get_data() method that uses a data folder to load the specific files, and on main there is an additional get_data() function that receives a data folder as input and passes this along to each task's get_data() method, however on the eval() protein.command and main() protein.automain, both in main, this get_data() function is called without specifying any data_folder. If the way to specify/correlate a dataset to a specific task could be explained, then I might be able to figure this out on my own; however in either case an example of adding a new task would be really helpful!

Best,
Chase

how to debug in pycharm using this project

the old keras project start in terminal,it is difficult to find the main file and due to the sacred control the logging and configure ,It is hard to using the dubug mode in pycharm? any advice ? or using the pytorch version ?

Rename and deploy to PyPI

After reorganizing the code in #1, it would be a good idea to deploy this code to PyPI to even further support reproducibility. However, there is already a package named tape, so it would be necessary to rename first...

Pretrained supervised task-specific weights for fluorescence task?

Hello,

I'm trying to run the fluorescence task with fully pretrained model weights (ie. for both the unsupervised pretraining and weights for the supervised task as well). I have downloaded the pretrained UniRep model but I'm thinking this does not include the model weights for the supervised task (fluorescence). Are these available or am I confusing something?

In the README.md|Loading a Model there is a bit about loading supervised task-specific weights but it wasn't clear to me.

Thanks for any guidance you have.

Scott

Eval error: CUDA_ERROR_OUT_OF_MEMORY

I have trained the fluorescence task model with:

!tape with model=unirep tasks=fluorescence load_from='pretrained_models/unirep_weights.h5' freeze_embedding_weights=True steps_per_epoch=100 datafile='data/fluorescence/fluorescence_train.tfrecords'

[Note used very small steps_per_epoch so it would train in a reasonable time so I can just get something working.]

Next I tried to evaluate the model using:

!tape-eval results/fluorescence_unirep_2019-07-30--17-22-15/ --datafile data/fluorescence/fluorescence_test.tfrecord'

but after only a few iterations the GPU memory use just explodes:

Model Parameters: 18219415 
Loading task weights from results/fluorescence_unirep_2019-07-30--17-22-15/task_weights.h5
Saving outputs to results/fluorescence_unirep_2019-07-30--17-22-15/outputs.pkl
37it [00:20,  2.00it/s, Loss=0.75, MAE=0.57]2019-07-30 19:48:39.226378: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-07-30 19:48:39.226439: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 8589934592

I'm running on a Tesla T4 with 14GB of memory (Google Colab).

The memory explosion would appear to be in
test_metrics = test_graph.run_epoch(save_outputs=outfile)

Any suggestions on how to resolve?

Thanks.

Scott

Use entrypoints for CLI

setuptools allows some nice configuration so you can transform python -m tape into simply tape.

I would also suggest moving the two evaluation scripts inside the repo as well, and using other entrypoints to access them. I will submit a PR and explain more!

Adding models to the benchmark

Hi, I am wondering if there is a way to add models that are not currently included in the benchmark?

Evaluation on contact prediction with binary metrics

Hello,
In the script tape/analysis/contact_prediction/evaluate_contact_prediction_metrics.py at the line 62 there is metric(label, prediction), but in the true label we will also have* -1 when the position is not correctly determined in the dataset(from valid_mask column), so the default parameter average='binary' will not work. So Is there something I have missed ?

*label possible value in my understanding : 1 : contact, 0: not in contact, -1 : not enough informations

The exact error : ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'] at line metrics.append(metric(label, prediction)).

songlab-cal / tape-neurips2019 Goto Github PK

tape-neurips2019's People

Contributors

Stargazers

Watchers

Forkers

tape-neurips2019's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs