GithubHelp home page GithubHelp logo

caseygrun / phage-seq Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 37 MB

Code from the manuscript "Bacterial cell surface characterization by phage display coupled to high-throughput sequencing."

Python 1.01% R 0.23% Jupyter Notebook 98.73% Dockerfile 0.02%

phage-seq's Introduction

Code and Data for "Bacterial cell surface characterization by phage display coupled to high-throughput sequencing."

DOI

Code used to process the high-throughput sequencing data in our manuscript "Bacterial cell surface characterization by phage display coupled to high-throughput sequencing."

This repository contains several components:

  • nbseq-workflow contains shared code for a Snakemake-based workflow for processing raw Phage-seq sequencing data. The other directories include symbolic links to the scripts, conda environments (envs), and rules defined in this directory. However, each specific experiment contains its own configuration, resources, and Snakemake workflow definitions (Snakefile and *.smk). Additionally, each experiment contains a set of Jupyter notebooks in workflow/notebooks which were used to generate figures in the manuscript.
    • panning-small contains code used to process the data from solid-phase and small-scale cell-based phage display selection campaigns reported in Fig. 2
    • panning-massive contains code used to process the data from high-throughput cell-based phage display selection campaigns reported in Fig. 3D
    • panning-extended contains code used to process the data from the extended rounds of high-throughput cell-based phage display selection reported in Fig. 3E
    • panning-minimal is a self-contained, runnable example containing a small subset of the data from panning-extended. The code is otherwise the same as panning-extended.
  • alpaca-library contains a Snakebake-based workflow for processing data where the entire input library was sequenced using longer reads in order to characterize its diversity.
  • other-figures contains raw data and code used to generate other figures in the paper that do not rely on sequencing data (e.g. ELISAs and phagemid titers).

The nbseq library is a close companion to this repository.

Installation and usage

  1. Clone this entire repository:

    Clone this repository to your local system or cluster, into the place where you want to perform the data analysis:

    git clone https://github.com/caseygrun/phage-seq.git
    cd phage-seq
  2. Install the Mamba (or Conda) package manager if it is not already installed on your machine or cluster. I recommend using the miniforge distribution.

  3. Install Snakemake using conda (or mamba):

    conda create -c conda-forge -c bioconda -n snakemake snakemake mamba

    This will create a conda environment called snakemake. All execution of the workflows will be done in this environment. For installation details, see the instructions in the Snakemake documentation.

  4. Prepare to execute the workflow

    Activate the conda environment:

    conda activate snakemake

    Change to the relevant directory, e.g. for panning-minimal:

    cd panning-minimal
  5. Execute the workflow

    You may want to test your configuration first by performing a dry-run via:

    snakemake --use-conda -n --cores all

    This will print the execution plan but not run any of the code.

    Each workflow can then be executed by running this command from the relevant directory:

    snakemake --use-conda --cores all

    Alternatively, some experiments are broken into multiple Snakemake workflows (specified in distinct .smk files) which can be invoked individually. The workflows are arragned this way because the preprocess step creates lots of large temporary files which you may not want to preserve on cluster storage; once this step is completed, the intermediate/ directory can be safely deleted and subsequent workflows invoked separately. The sub-workflows can be invoked using the -s flag to specify a path to a specific workflow file, e.g.:

    snakemake --use-conda --cores all -s workflow/preprocess.smk
    snakemake --use-conda --cores all -s workflow/feature_table.smk
    snakemake --use-conda --cores all -s workflow/downstream.smk

    See the Snakemake documentation for further details. In particular, you likely want to adjust --cores $N to specify the number of CPU cores available on your machine or cluster instance.

Included demonstrations

Several demonstrations of the workflow code and notebooks are available:

  • panning-minimal, a subset of the raw data from panning-extended is embedded in the repository so that you can run the entire analysis pipeline, then interactively explore the results using the nbseq library.

  • The full processed datasets for panning-small, panning-massive, panning-extended, and alpaca-library are available for download from Zenodo. Using these data packages, you do not need to run the workflow(s) but can explore the results interactively using the nbseq library. These datasets are suitable to generate the figures in the paper using the code in the workflow/notebooks subdirectories. See detailed instructions below.

panning-minimal self-contained demonstration

The panning-minimal dataset consists of six selections (three biological replicates each of two different conditions), plus several samples where the raw input library was sequenced without panning. Each sample has been arbitrarily downsamples to 7500 reads for the sake of file space and processing time.

To run the complete workflow, follow the steps above. The analysis requires 15--30 minutes to download and install conda packages, then ~10--15 min to run on a 16-core workstation.

After running the Snakemake workflow, launch a Jupyter Lab server as described in the nbseq repository, and navigate to the Jupyter notebook panning-minimal/workflow/notebooks/analysis.ipynb.

Importantly, the first time you open this notebook, you will be prompted to choose the "notebook kernel;" this will dictate in which conda environment the Python process runs. Assuming you installed nbseq into an environment called nbseq, choose the entry titled "Python [conda env:nbseq]" and click "Select."

panning-extended processed dataset

Download and extract the processed dataset from Zenodo:

cd panning-extended
wget https://zenodo.org/records/12825488/files/panning-extended-results.tar.gz
tar vzxf panning-extended-results.tar.gz

Then, launch a Jupyter Lab server as described in the nbseq repository, and navigate to the Jupyter notebook panning-extended/workflow/notebooks/analysis.ipynb. Importantly, the first time you open this notebook, you will be prompted to choose the "notebook kernel;" this will dictate in which conda environment the Python process runs. Assuming you installed nbseq into an environment called nbseq, choose the entry titled "Python [conda env:nbseq]" and click "Select." There are several other Jupyter notebooks within panning-extended/workflow/notebooks/ which produce figures that appear in the manuscript and can be executed similarly (i.e. using the nbseq conda environment).

The data package will create results/ and intermediate/ subdirectories, which will contain the full amino acid and CDR3 feature tables, feature sequences, and a SQLite database for interactive exploration. All data necessary to execute the demonstration notebook is included.

Note that several additional files such as various transformed feature tables and beta-diversity calculations are omitted for the sake of simplicity and file size, but they can be regenerated by running the included snakemake workflow, workflow/downstream.smk. Likewise, the large mmseqs2 database (which is needed to search the dataset for VHHs with similar sequences) is not included; it can be re-generated on-demand by running snakemake --use-conda --cores all -s workflow/downstream.smk -- intermediate/cdr3/features_db/ for the CDR3 feature space or snakemake --use-conda --cores all -s workflow/downstream.smk -- intermediate/aa/features_db/ for the amino acid feature space.

panning-massive processed dataset

Download and extract the processed dataset from Zenodo:

cd panning-massive
wget https://zenodo.org/records/12825488/files/panning-massive-results.tar.gz
tar vzxf panning-massive-results.tar.gz

The data package will create the results/ subdirectory, containing all data necessary to execute the Jupyter notebooks in panning-massive/workflow/notebooks/; these notebooks generate the figures that appear in the manuscript and can be executed as described above.

Note that the notebook panning-massive/workflow/notebooks/fig-suppl-learning.ipynb requires creating a different conda environment, nbseq-xgb; this environment can be created by running:

cd panning-massive
conda env create -f envs/nbseq-xgb.yaml

panning-small processed dataset

Download and extract the processed dataset from Zenodo:

cd panning-small
wget https://zenodo.org/records/12825488/files/panning-small-results.tar.gz
tar vzxf panning-small-results.tar.gz

Then explore the notebooks in panning-small/workflow/notebooks/

alpaca-library processed dataset

Download and extract the processed dataset from Zenodo:

cd alpaca-library
wget https://zenodo.org/records/12825488/files/alpaca-library-results.tar.gz
tar vzxf alpaca-library-results.tar.gz

Then explore the notebooks in alpaca-library/workflow/notebooks/.

other-figures

other-figures/data includes the data necessary to run the notebooks within other-figures/code. These notebooks produce other figures in the paper.

Adapting the workflow to future experiments

Workflow data model

These workflows are designed to accommodate several realities of our experiments:

  • Each experiment may involve multiple selection campaigns
  • Each phage display experiment involves observing at least one population of VHHs multiple times over rounds of biopanning
  • Each sample may be sequenced across multiple sequencing runs
  • There may be technical replicate observations at the level of library preparation
  • Different selections in the experiment may use entirely different starting phage-displayed VHH libraries with different consensus reference sequences

Key elements of the data model:

  • A selection refers to enrichment of a distinct population of phage-displayed VHHs via multiple rounds of panning against a particular pair of antigen conditions (e.g. counter-selection and selection bacterial cells, antigen-negative and antigen-positive protein-coated wells, etc.); For example, a selection for flagella using a ∆fliC mutant as the counter-selection cells and a wild-type strain as the selection cells occurring over 4 rounds of selection.
    • Each selection is assigned a single "phage library"; this mainly dictates to which reference sequence reads should be aligned. For example, if selections in the experiment involved two different starting libraries---one from an alpaca immunization and one created synthetically---these would be considered different "phage libraries".
  • A sample is a single technical replicate observation of a population of phage/VHHs, for example: the post-amplification ("input") phage before round 2 of the above selection, prepared for sequencing with a distinct pair of barcode sequences.
  • Multiple sequencing runs may be performed. Each sequencing run will be denoised separately, then replicate observations of the same sample will be summed in the feature_table step.

Configuring the workflow for future experiments

panning-extended can be duplicated and used as a template for future experiments. Follow the installation instructions above. Duplicate the panning-extended directory and rename it. The following files need to be modified to suit the workflow:

  1. Edit config.yaml to configure the experiment and specify features of the starting VHH library/libraries, namely:

    • raw_sequence_dir: Path to folder containing raw input sequences

    • scratch: Path to scratch directory

    • libraries: Phage display (VHH) libraries included in this experiment. Dict where keys are library names and values are objects with the following attributes:

      • primer_fwd (default ""): Forward primer sequence (5' to 3'); used in the cutadapt preprocessing step to identify reads corresponding to properly-prepared amplicons

      • primer_rev (default ""): Reverse primer sequence (5' to 3'); used in the cutadapt preprocessing step to identify reads corresponding to properly-prepared amplicons

      • reference (default None): Path to the reference sequence, in FASTA format

      • reference_frame_start_nt (default 0): Nucleic acid position (0-based) indicating the first base of the first codon of the reference sequence

      • reference_length_aa (default 0): Length of the reference sequence in amino acids; if the reference sequence is longer than (reference_frame_start_nt + (reference_length_aa * 3)) nt, it will be trimmed.

      • min_fwd_end (default 0): 3' (distal) end of the forward read must align to this NA position or later; position 0 is the first base of the reference sequence, irrespective of reference_frame_start_nt.

      • max_rev_start (default 0): 3' (distal) end of reverse read must align to this NA position or earlier; position 0 is the first base of the reference sequence, irrespective of reference_frame_start_nt.

      • min_aa_length (default 69): Reads where the aligned amino acid sequence (excluding gap characters) is shorter than this length will be dropped

      • CDRs (default {}): Position of CDR and FR regions within the reference sequence, in amino acid space. In this object, position 0 refers to the amino acid corresponding to reference_frame_start_nt. This object should be a dict mapping domain names (e.g. 'CDR1', 'FR2', etc.) to [start, end] positions, where start is inclusive and end is exclusive (e.g. half-open) intervals, following the Python convention.

        Example:

          'CDRs': {
          	'FR1':  [0,  21],
          	'CDR1': [21, 29],
          	'FR2':  [29, 46],
          	'CDR2': [46, 54],
          	'FR3':  [54, 92],
          	'CDR3': [92, 116],
          	'FR4':  [116,127],
          }
        
      • min_CDR_length (default {}): Reads with domains (CDR or FR regions) shorter than this length (in amino acids) will be dropped. Dict where keys are domain names (e.g. 'CDR1', 'FR2', etc. and should correspond to domains defined in CDRs) and values are minimum lengths (in amino acids)

  2. Edit metadata.csv to specify parameters of each selection. Each row represents a single selection. Columns describe parameters of the selection, including the bacterial strains used for selection and counter-selection. The following columns are required. Additional columns can be added to identify phenotypes associated with the selection.

    Column Description
    expt Sub-experiment
    selection Name for the selection; typically this is in the form {plate}.{well}, corresponding to the location within the biopanning microplate
    antigen Which antigen is targeted by this selection
    background_CS Genetic background for the counter-selection strain
    genotype_CS Genotype of the counter-selection strain, relative to the genetic background
    strain_CS Strain number or identifier of the counter-selection strain, if applicable
    cond_CS Growth condition of the counter-selection cells
    background_S Genetic background for the selection strain
    genotype_S Genotype of the selection strain, relative to the genetic background
    strain_S Strain number or identifier of the selection strain, if applicable
    cond_S Growth condition of the selection cells
  3. Edit phenotypes.csv to provide information about the phenotypes, e.g. biological covariates of each selection denoted by additional columns of metadata.csv. Importantly, you may want to mark certain phenotypes as "antigens" for the sake of classifier training routines.

    Column Description
    name name of the phenotype; should correspond to one of the columns in metadata.csv
    type "antigen" or "phenotype"
    category (optional) the category of antigen or phenotype, e.g. "motility"
    locus (optional) locus tag
    description (optional) description of the antigen or phenotype
  4. Edit samples.tsv to list each technical replicate:

    Column Description
    ID sample identifier, generally {plate}-{well}; should correspond to a pair of raw sequence files described below
    expt sub-experiment number
    sample name of the selection, corresponds to selection column in metadata.csv above
    round selection round, in format R{n}{io} where {n} is replaced by the round number and {io} is replaced with i for input (e.g. post-amplification) or o for output (e.g. pre-amplification)
    phage_library name of the starting library, for the sake of determining the applicable reference sequence and related parameters. Must correspond to a key within the libraries entry in config.yaml
    plate (optional) HTS library preparation plate number
    well (optional) HTS library preparation well number
    depth (optional) expected relative sequencing depth
    notes (optional) notes about the selection
  5. Edit runs.tsv to specify details about each sequencing run. The only required column is ID; each run must have a corresponding directory in ${raw_sequence_dir}/${ID} containing the demultiplexed sequencing data as described below. Additional columns may be used to record other data about the run, e.g. instrument, flow cell ID, date, etc.

  6. Add the raw data to the path raw_sequence_dir specified in config.yaml (by default, this is a directory called input):

    • There should be a subdirectory for each sequencing run, i.e.: ${raw_sequence_dir}/${sequencing_run_ID}
    • There should be a folder for each sample, i.e. ${raw_sequence_dir}/${sequencing_run_ID}/Sample_${sample_id}, where ${sequencing_run_ID} corresponds to the column ID in runs.tsv above
    • Within that sample, there should be two files named ${raw_sequence_dir}/${sequencing_run_ID}/Sample_${sample_id}/*_R1_*.fastq.gz and ${raw_sequence_dir}/${sequencing_run_ID}/Sample_${sample_id}/*_R2_*.fastq.gz containing the forward and reverse reads respectively. ${sample_id} corresponds to the column ID in samples.tsv above

phage-seq's People

Contributors

caseygrun avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.