GithubHelp home page GithubHelp logo

bertax_training's Introduction

BERTax training utilities

This repository contains utilities for pre-training and fine-tuning BERTax models, as well as various utility functions and scripts used in the development of BERTax.

Additionally to the described mode of training BERTax on genomic DNA sequences, development scripts for training on gene sequences are included as well.

Training new BERTax models

Data preparation

For the training of a new BERTax model the user must provide one of the two following data-structures.

fragments directories

For training the normal, genomic DNA-based models, a fixed directory structure with one json file per class, consisting of a simple list of sequences is required:

[class_1]/
  [class_1]_fragments.json
[class_2]/
  [class_2]_fragments.json
.../
[class_n]/
  [class_n]_fragments.json

Example fragments file:

["ACGTACGTACGATCGA", "TACACTTTTTA", ..., "ATACTATCTATCTA"]

gene model training directories

The gene models were used in an early stage of BERTax development, where a different directory structure was required:

Each sequence is contained in a fasta file, additionally, a json file containg all file-names and associated classes can speed up preprocessing tremendously.

[class_1]/
  [sequence_1.fa]
  [seuqence_2.fa]
  ...
  [sequence_n.fa]
[class_2]/
  ...
.../
[class_n]/
  ...
  [sequence_l.fa]
{files.json}

The json-files cotains a list of two lists with equal size, the first list contains filepaths to the fasta files and the second list the associated classes:

[["class_1/sequence1.fa", "class_1/sequence2.fa", ..., "class_n/sequence_l.fa"],
 ["class_1", "class_1", ..., "class_n"]]

Training process

The normal, genomic DNA-based model can be pre-trained with models/bert_nc.py and fine-tuned with models/bert_nc_finetune.py.

For example, the BERTax model was pre-trained with:

python -m models.bert_nc fragments_root_dir --batch_size 32 --head_num 5 \
       --transformer_num 12 --embed_dim 250 --feed_forward_dim 1024 --dropout_rate 0.05 \
       --name bert_nc_C2 --epochs 10

and fine-tuned with:

python -m models.bert_nc_finetune bert_nc_C2.h5 fragments_root_dir --multi_tax \
       --epochs 15 --batch_size 24 --save_name _small_trainingset_filtered_fix_classes_selection \
       --store_predictions --nr_seqs 1000000000

The development gene models can be pre-trained with models/bert_pretrain.py:

python -m models.bert_pretrain bert_gene_C2 --epochs 10 --batch_size 32 --seq_len 502 \
	 --head_num 5 --embed_dim 250 --feed_forward_dim 1024 --dropout_rate 0.05 \
	 --root_fa_dir sequences --from_cache sequences/files.json

and fine-tuned with models/bert_finetune.py:

python -m models.bert_finetune bert_gene_C2_trained.h5 --epochs 4 \
	 --root_fa_dir sequences --from_cache sequences/files.json

All training scripts can be called with the --help flag to adjust various parameters.

Using BERT models

It is recommended to use fine-tuned models in the BERTax tool with the parameter --custom_model_file.

However, a much more minimal script to predict multi-fasta sequences with the trained model is also available in this repository:

python -m utils.test_bert finetuned_bert.h5 --fasta sequences.fa

Benchmarking

If the user needs a predefined training and test set, for example for benchmarking different approaches:

python -m preprocessing.make_dataset single_sequences_json_folder/ out_folder/ --unbalanced

This creates a the files test.tsv, train.tsv, classes.pkl which can be used by bert_nc_finetune

python -m models.bert_nc_finetune bert_nc_trained.h5 make_dataset_out_folder/ --unbalanced --use_defined_train_test_set

If fasta files are necessary, e.g., for competing methods, you can parse the train.tsv and test.tsv via

python -m preprocessing.dataset2fasta make_dataset_out_folder/

Additional scripts

preprocessing/fasta2fragments.py / preprocessing/fragments2fasta.py
convert between multi-fasta and json training files
preprocessing/genome_db.py, preprocessing/genome_mince.py
scripts used to generate genomic fragments for training

Dependencies

  • tensorflow >= 2
  • keras
  • numpy
  • tqdm
  • scikit-learn
  • keras-bert
  • biopython

bertax_training's People

Contributors

f-kretschmer avatar flomock avatar qianjinydx avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.