jbloomlab / barcoded_flu_pdmh1n1 Goto Github PK

Barcoded pdmH1N1 virus hashing experiment

Python 0.66% Shell 0.01% Jupyter Notebook 99.24% R 0.09%

barcoded_flu_pdmh1n1's Introduction

Barcoded pdmH1N1 influenza virus single-cell sequencing

Single-cell sequencing of barcoded pdmH1N1 influenza virus; David Bacsik and Jesse Bloom.

Pre-print of results is titled Influenza virus transcription and progeny production are poorly correlated in single cells and is available at https://www.biorxiv.org/content/10.1101/2022.08.30.505828v1.

Repository version

A static version of the repository used to generate the figures in this pre-print is tagged at: https://github.com/jbloomlab/barcoded_flu_pdmH1N1/releases/tag/bioRxiv_v1.

Data availability

All data used in this study is available in GEO under accession number GSE214938.

Summary of workflow and results

The workflow for this project has two main steps. First, the Snakemake pipeline is run which takes raw sequencing data as input and generates a CSV containing information about viral transcription and progeny production in single influenza-infected cells. Then, the final_analysis.py.ipynb is run manually to visualize the results.

For a summary of the Snakemake pipeline, see the report.html file that is placed in the ./results/ subdirectory.

Organization of repository

This repository is organized as followed (based loosely on this example snakemake repository):

Snakefile is the snakemake file that runs the analysis.
environment.yml and environment_unpinned.yml give the version pinned and unpinned conda environment used to run the Snakemake pipeline.
config.yaml contains the configuration for the analysis.
cluster.yaml contains the cluster configuration for running tha analysis on the Fred Hutch cluster.
./rules/ contains snakemake rules.
./notebooks/ contains Jupyter notebooks that are run by Snakefile using the snakemake notebook functionality.
./scripts/ contains scripts used by Snakefile.
./pymodules/ contains Python modules with some functions used by Snakefile.
./report/ contains workflow description and captions used to create the snakemake report.
./data/ contains the input data, specifically:
- ./data/flu_sequences/ gives the flu sequences used in the experiment. See the README in that subdirectory for details.
- ./data/flu_sequences/pacbio_amplions gives the famplicon sequences generated for pacbio sequencing. See the README in that subdirectory for details.
./results/ is a created directory with all results, most of which are not tracked in this repository.
./results/figures/ contains the figures generated for the manuscript.
./results/viral_fastq10x/ contains two CSV files containing key processed data:
- integrate_data.csv contains viral transcription and genotype information for all cells in the dataset.
- complete_measurement_cells_data.csv contains progeny production information ,viral transcription information, and genotype information for the set of cells with complete sequencing and progeny production measurements.

Running the analysis

Installing software

The conda environment for the pipeline in this repo is specified in environment.yml; note also that an unpinned version of this environment is specified in environment_unpinned.yml. If you are on the Hutch cluster and set up to use the BloomLab conda installation, then this environment is already built and you can activate it simply with:

conda activate barcoded_flu_pdmH1N1

Otherwise you need to first build the conda environment from environment.yml and then activate it as above.

In addition to building and activating the conda environment, you also need to install cellranger and bcl2fastq into the current path; the current analysis uses cellranger version 4.0.0 and bcl2fastq version 2.20.

Run the Snakemake pipeline

Once the barcoded_flu_pdmH1N1 conda environment and other software have been activated, simply enter the commands to run Snakefile and then generate a snakemake report, at ./results/report.html. These commands with the configuration for the Fred Hutch cluster are in the shell script. run_Hutch_cluster.bash. You probably want to submit the script itself via sbatch, using:

sbatch run_Hutch_cluster.sbatch

Run the final analysis and generate plots

When the Snakeamke pipeline has run completely, the processed output data is exported to a CSV file at results/viral_fastq10x/{expt}_integrate_data.csv. A stable version of this file is available at https://github.com/jbloomlab/barcoded_flu_pdmH1N1/blob/main/results/viral_fastq10x/scProgenyProduction_trial3_integrate_data.csv and can be used to re-analyze the data without running the Snakemake pipeline.

In this repo, the CSV file is used to perform the final analysis and generate figures in the final_analysis.py.ipynb notebook. This notebook is run manually. This notebook must be run with the barcoded_flu_pdmH1N1_final_anlaysis conda environment activated.

To activate this environment, first build it from envs/barcoded_flu_pdmH1N1_final_analysis.yml and then activate it with:

conda activate barcoded_flu_pdmH1N1_final_analysis

Development

Linting the code

Ideally, before you a new branch is committed, you should run the linting in lint.bash with the command:

bash ./lint.bash

This script runs:

snakemake linting
a snakemake dry run
a flake8 analysis of the Python code
a flake8_nb analysis of the Jupyter notebooks.

For the Jupyter notebook linting, it may be easiest to lint while you are still developing notebook with run cells rather then before you put the empty notebook in ./notebooks/, as the linting results are labeled by cell run number.

barcoded_flu_pdmh1n1's People

Contributors

Stargazers

Watchers

Forkers

chenddathku

barcoded_flu_pdmh1n1's Issues

Add itertools and editdistance packages to environment

@jbloom Could you please add itertools and editdistance to the conda environment whenver you get a chance?

rename `master` branch to `main`

GitHub / git are changing the names of branches in this way. It will probably simplify our lives if we just do this now:
https://www.zdnet.com/article/github-to-replace-master-with-main-starting-next-month/

Replicate plots raise imagemagick error

The replicate plots cannot be converted by imagemagick to PNG files for the report. The following error is raised when the report is generated:
Failed to convert image to png with imagemagick convert: b"\n(process:680): librsvg-CRITICAL **: 20:08:25.584: Handle could not read or parse the SVG; did you check for errors during the loading stage?\n\n(process:680): librsvg-CRITICAL **: 20:08:25.613: Handle could not read or parse the SVG; did you check for errors during the loading stage?\nconvert: negative or zero image size `results/viral_progeny/scProgenyProduction_trial2_viral_bc_replicates.svg' @ error/image.c/SetImageExtent/2640.\nconvert: no decode delegate for this image format `results/viral_progeny/scProgenyProduction_trial2_viral_bc_replicates.svg' @ error/svg.c/ReadSVGImage/3429.\nconvert: no images defined `png:-' @ error/convert.c/ConvertImageCommand/3282.\n"

One possible cause is that the replicate plots are exported as very large SVG files. I think the SVG files are very large simply because there are so many data points. However, the error includes the line negative or zero image size, so maybe this is not the cause.

I will explore this error further, but a simple fix might be to export a high-resolution raster image rather than SVG.

Align pacbio

Aligning pacbio reads to flu references.

@jbloom when I'm aligning the reads there seems to be some issue in my annotation because [most reads end up not aligning](notebooks/align_pacbio.py.ipynb). The most common reasons are 5' and 3' clip for sequenced_mRNA_1 and sequenced_mRNA_2, respectively. Maybe I don't totally understand how termini5 and termini3 need to be defined. At the moment I have these defined as sequences expected at the ends of length amplicons, but are these perhaps supposed to be termini of what is annotated as sequenced_mRNA_1/2?

Correct viral barcodes within each cell

I will correct viral barcodes within each cell in the transcriptome data. This will generate one (or a few) consensus viral barcodes for each cell.

It will also give information on the relative expression of each virus' genome in co-infected cell.

specify PacBio data

This would be in config.yaml.

Under your experiment, you're going to want to add an entry that is something like:

hashing_highMOI:
  ...
  pacbio_viral_sequencing:
    2020-10-20: <path to subreads file>
    another_run: another_subreads

Then there is actually a Python class that parses these experiments that you can use to access everything about that experiment.

This is in pymodules/experiments.py.

You will want to make that parse this information and then add a method that is something like:

def pacbio_fastqs(self, expt):
   """list: List of all PacBio FASTQ subreads for `expt`."""

Align reads with cellranger

Use cellranger's count function to align reads, collapse UMIs, and generate a cell-gene matrix.

Quantify strand exchange

@jbloom this is related to your comments.

I have changed it so now all viral tags are added and not via itterrows. But I'm not quite sure how to then compare what is the expected wt or syn tag without iterating over rows. Could you give a suggestion? Am I just overthinking with doing it via dataframes should I just do some lookup dictionary again?
The reason other NA tags where not in there is because files only included variant_tag_1 and variant_tag_2 for all segments so other NA tags where not parsed by alignparse, but I've added the other tags and that should solve it.

Normalize viral barcode UMI counts to all UMI counts for cell

I want to filter viral barcodes in the transcriptome data with a minimum threshold. This threshold should be based on fraction of all UMIs for a given cell barcode, since capture efficiency varies wildly from droplet to droplet.

Therefore, I need to normalize the viral barcode UMI counts to the total number of UMI counts for each cell.

Convert pacbio read coordinates to original segment nucleotide positions

I don't know exactly how to do this, but, as a start:
after linearization all amplicons end up having the ORF split into sequenced_mRNA_1 and sequenced_mRNA_2, so maybe I could just make a table that tells for each amplicon what is the start and end positions for sequenced_mRNA_1 and sequenced_mRNA_2 in relation to their original ORF. And then we could use this table to convert coordinates, e.g, if theres a change at position 329 of sequenced_mRNA_2 of HAmid aligned read and, in relation to ORF, sequenced_mRNA_2 starts at nucleotide 300 then the position that changed in HA is 639.

tracking changes to code

I think Snakemake has a way to directly track code so we don't need to define Jupyter notebooks as input.

summarize PacBio CCSs

@Bernadetadad, let me know if you have questions on best way to do this. I have to go to daycare now, but can check back around 8:15 after I put Alice to bed.

Create consensus reads in cells

Using hashing_highMOI_aligned_with_ref_annotation.csvmutation table created by align_pacbio rule we will make consensus flu reads found in cells.

Things to consider:

-select cell barcodes that are associated with 'true' infections

group reads by cell barcode
plot UMI frequencies, see if some UMIs are overrepresented, if so maybe take only one or build consensus from the UMI group
should termini3 feature be used. Looking at the mutation table is seems like almost all CSSs have mutations in termini3 and most of those mutations are just poly dT primer misalignment , which does not really have anything to do with the actual mutations in flu reads in cells.
related to above there are quite a few mutations in other segments that have long polyA tracts or something like insG/CAAAAA which is also probably just mis-priming, but I guess they would just filter out when we select some mutation frequency factor (Alistair used >30% of reads with a given mutations. which is maybe a sensible place to start)

sample names for progeny production pilot

@dbacsik says:

N.b. Also note that some of the samples in this pilot were assigned to the wrong sequencing indices, and this notebook manually corrects this in the results folder. This needs to be revised so that the notebook can be run automatically.

Correct viral barcodes within a sequencing sample

Using UMI tools, I will correct barcodes with errors back to their ancestral sequence. I will treat each sequencing sample (e.g. Experiment, source, gene, tag) as its own independent set of barcodes.

I will input a dataframe with the following information:

Source
Gene
Tag
Barcode
Average frequency (mean of two technical replicates)

I will export a new CSV with the corrected barcode counts for each experiment. E.g. scProgenyProduction_trial1_corrected_viral_bc_in_progeny.csv.gz.

As part of the quality control for this step I will manually examine some of the barcodes that are corrected.

UMI tools uses a directional adjacency method to correct barcodes. In layman's terms, it corrects any minority barcode within a specified edit distance if that barcode is less than 2X as abundant compared to the majority barcode. The algorithm is explained well here: https://cgatoxford.wordpress.com/2015/08/14/unique-molecular-identifiers-the-problem-the-solution-and-the-proof/

add `dms_variants` to environment

For IlluminaBarcodeParser.

Plot barcode sharing across tags

The progeny viral barcode data were generated completely independently for each (source, gene, tag) combination. Now, we can ask: are viral barcodes unique to each viral tag? Or is the same barcode found in both the wt sequencing sample and the syn sequencing sample?

Incorporate conda environment into snakemake rules

I think the cluster might be sensitive to changes in conda environment that occur in the midst of a run. For example, I had the barcoded_flu_pdmH1N1 conda environment active, and I submitted a cluster job. Then, while the job was running, I switched to the BloomLab environment, and the cluster job had an issue finding a package that is present in the barcoded_flu_pdmH1N1 environment, but not in the BloomLab environment for a later rule.

It appears that the conda environment can be specified on a per-rule basis in Snakemake: https://snakemake.readthedocs.io/en/v3.9.1/snakefiles/deployment.html. This approach would 1) automate loading the project-specific conda library, which must currently be done manually before running the Snakefile; 2) make it explicit which packages are being used by the pipeline; and 3) Make the pipeline robust to any changes the user performs mid-run.

set up viral genome and GTF files

Parsing new sequences David added.

Calculate edit distance between viral barcodes

We have sampled a few thousand virions from a very diverse library. In theory, I would expect this sparse sampling to yield viral barcodes that are not similar to one another. I will try to calculate the edit distance between barcodes in the viral progeny sequencing data to see if this holds true.

I will plot the results as a histogram.

The distribution of edit distances between viral barcodes will inform the threshold I choose for barcode correction.

N.b. The dms_variants tool excludes barcodes that do not match the barcode length specified in the genbank file. I think the edit distance is equal to the hamming distance in this case, since substitutions are tolerated and indels are not.

Load data from pilot on 20200116

Just received sequencing back from experiment I did on 01/16. I will add this sample and run the pipeline.

what samples should we keep paying attention to?

What samples are pilots that are really used for all planned purposes, versus samples that are useful for either barcode hashing or progeny production?

I'd suggest removing from config YAML the ones that we no longer need, and also just thinking about names for the ones we care about (current names may be fine and may be needed to be retained for matching other stuff you've done, but if that's not the case and you want to rename this is the time).

Trim reads for quality

From what I can tell, STARsolo is not doing any quality trimming before alignment. Since we used 272 cycles for read 2, there might be some quality loss at the end of these reads. I will figure out how to trim them before alignment and add a Snakemake rule to do so.

`expect_ncells` for each experiment

@dbacsik, I'm updating pipeline to make it possible to incorporate this, so wait until that. But then we should specify expected number of cells for each experiment rather than a universal value for all experiments as done now.

Add umi_tools to enviroment

@jbloom Could you please add umi_tools to the environment?

Call infection rate

Want to see what fraction of cells were infected by at least one virus. For a first pass, I will use a simple cutoff of 1% of UMIs coming from flu reads.

run `ccs` on the subreads

So you can make a snakemake rule that for each subreads FASTQ, it runs the ccs program.

It might be something like this:
https://github.com/jbloomlab/SARS-CoV-2-RBD_DMS/blob/master/Snakefile#L386-L413

define "true" viral barcodes

@dbacsik says:

The main question right now is how we define and filter "true" viral barcodes from the transcriptome data and from the supernatant/second infection sequencing data.

The data show a lot of viral barcodes that are unique to each sequencing sample. The number of unique viral barcodes in the sequencing far exceeds the plausible number of virions infected into the cells (~1.25e4 visions).
To distinguish real barcodes from errors, I have used the following filtering criteria:

The barcode must be present in the “cell” (first infection) sample. It stands to reason that any real barcode present in the supernatant or second infection should be derived from the first infection.

The barcode must be present above some threshold frequency.

This works OK, and there are reasonable correlations between the barcodes in the first infection, supernatant, and second infection. However, there is still substantial noise, and the raw data still shows barcodes that are present at high frequency in the supernatant or second infection but not found in the first infection.

Specify viral bc sequencing information

Need to decide on how we will specify viral bc sequencing data. Will we start with BCL files+indices, or fastq files? Then, need to write rules that load and parse the barcodes.

Make dataframe names unique

Many plots depend on temporary dataframes, which I often name working_df. The code cannot be run out of order, or the wrong dataframe will be passed to a plot. I will rename each dataframe with a unique and descriptive identifier for stability and readability.

`STARsolo` flag `--soloFeatures GeneFull`, should we use it?

@dbacsik, I am not up to speed on details, but I recall a few times that @Bernadetadad has suggested that this flag might be useful. It doesn't appear to be in current pipeline. Is that something we should explore using?

Download GTF files of genomes

Write Snakemake rules to download cell genome GTF files.

Make issues automatically show up on project board.

integrate viral barcode frequencies and transcriptomics

We need to figure out how to do this.

Download cellular genomes

Add to config.yaml information on cell_genome, cell_gtf, spikein_cell_genome, etc.

Then add rules to Snakefile to download these somewhere into results in a sensibly named subdirectory.

You need to add those files themselves to the final all rule in Snakefile to make it download them.

only call `viral_barcodes_in_progeny` rule for samples with progeny sequencing data

I wrote a rule, viral_barcodes_in_progeny, which parses and counts viral barcodes in supernatant or second infection sequencing data.

Right now, this rule is called for all experiments in the config.yaml file. However, only some experiments have this type of data.

So, I need to write some kind of logic into the pipeline so that this rule is only called for experiments which have viral_barcode data.

Make "Reference Package" for cellranger

Create a set of Snakemake rules that automatically generate a cellranger "reference package" (i.e. reference genome to align against). Make sure to include the flu genome, the host genome, and the spike in genome.

Build target pacbio amplicons

@jbloom I think the next step in pacbio analysis is to make target amplicon sequences for pacbio runs. I'll follow the docs in single-cell example for alignparse. I definitely can build the .gb file manually, I'm not sure I'll be able to write python scripts to reformat plasmid maps without some help. Also, I assume each segment should have two references- one for amplification from termini and one for amplification from middle the middle of the segment.

cell annotations CSV is gzipped but not in name

differences in tagged NA segments?

@dbacsik: You uploaded plasmid maps for the wildtype and synonymous tagged NA and HA genes here:
https://github.com/jbloomlab/flu_pdmH1N1_barcode_hashing/tree/master/data/flu_sequences/plasmid_maps

For HA and all other non-barcoded genes, things are as expected with the tagged variants differing from wildtype by at two nucleotides, one at each end.

But for NA, there are 8 differences, 6 of which are in the middle of NA and don't correspond to what I thought would be tags. The differences in 0-based indexing of the vRNA are:

[(241, 'G', 'A'), (817, 'C', 'T'), (826, 'C', 'T'), (827, 'A', 'T'), (828, 'G', 'C'), (829, 'C', 'T'), (1246, 'C', 'T'), (1399, 'C', 'A')]

Can you take a look ASAP and see what is going on here? If we have lots of additional differences in NA we should assess the impact of that; or if it's just mis-labeled maps we should fix that.

Jupyter-nbconvert raises error

@jbloom I think I figured out how to solve this problem. I do not have permission to access the html directory in /fh/fast/bloom_j/software/miniconda3/envs/barcoded_flu_pdmH1N1/share/jupyter/nbconvert/templates/. A file here is needed by jupyter-nbconvert. Could you please change the group permissions for this directory and its files?

The details and rationale are explained below:
In the current pull request, #52 refactor of pipeline, the qc_fastq10x rule triggers an error when it calls jupyter-nbconvert. I imagine other rules that use the new log: notebook= format will suffer from the same fate, however, the pipeline stops before it can call them currently. This is with the most recent barcoded_flu_pdmH1N1 conda environment.

The error message is in the attached error.txt file. It ends in this permission error:
PermissionError: [Errno 13] Permission denied: '/fh/fast/bloom_j/software/miniconda3/envs/barcoded_flu_pdmH1N1/share/jupyter/nbconvert/templates/html/conf.json'

All of the directories in the nbconvert/templates directory are owned by bloom_j_grp. All of the directories except html have a read flag in the group place in the permissions. I cannot open the html directory, but I can browse the other directories.

`conda` environment for analysis

We should make a dedicated conda environment for these analyses rather than using BloomLab

call truly infected cells

Can just be a cutoff of UMIs from virus.

We could also us the viral tag data. Something like this:

For each cell, calculate the ratio of viral tag variant total UMI counts for all transcripts. So for an uninfected cell, this will be close to zero for both tags. For a cell infected with just wt_tag, it will be large for wt tag and small for syn_tag. For a doublet of infected cells, it will be large both for tags.
For each cell, then take the ratio above for the less abundant tag in that cell. If it's not a double, this represents background viral mRNA from leakage.
Assume some max plausible doublet rate. (Say 10% although we should think about this). Remove the top 10% of the cells in the step above.
Now using the other 90% of cells, we have an estimate of how many viral tag counts you should get just from background.
Call infected any cells that have the more abundant tag as higher than that.

Plot distributions and correlations of viral barcode replicates

The viral barcodes were sequenced in two technical replicates, starting from reverse transcription. I will plot the correlation between the two replicates for each sample.

@jbloom a few questions while I set this up:

Should this be done on raw barcode counts, or after error correction and filtering? I am leaning towards raw data, since one of the things I want to assess is bottlenecking, and I think that will be easier to interpret before filtering.
Should this be its own notebook, or added to the barcode parsing notebook? I am leaning towards making it its own notebook.

Transform corrected barcode dataframe into lookup table

I have generated the directional adjacency groups for correcting viral barcodes. The code is in this this notebook.

Between 1/4 and 1/2 of barcodes are corrected for each sample. I was surprised that the fraction is this high. However, when I manually spot checked the groups, they all make sense. The vast majority of raw barcodes in a group are within an edit distance of 1 to the corrected barcode. When a raw barcode is an edit distance of 2 from the corrected barcode, there is another raw barcode in the group that represents an intermediate step of edit distance 1 (e.g. [ATTA -> ATGA -> ATGG]).

@jbloom the current format of the groups is a data frame where each row represents an adjacency group. The index is the corrected barcode, and each column contains a raw barcode or None:

Could you please help me transpose this data frame so it can function as a lookup table? The format for this will be:
Each row represents a raw barcode. There are two columns. One contains the raw barcode and the other contains the corrected barcode.

Then, I will merge this lookup table into the barcode frequency dataframe to start aggregating barcode frequencies.

I have been trying to perform this transformation on my own and have not been able to figure it out.

Parse viral supernatant barcode sequencing

use new `Snakemake` Jupyter notebook integration

Snakemake now has a Jupyter notebook integration which we should probably use to replace the way we were running notebooks before: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration

sequencing depth

With respect to the hashing experiment, @dbacsik writes:

Can we discuss whether or not sequencing depth is sufficient to be confident in “single-virion” infection calls? How can we assess this?

Load data from further sequencing

Split a HiSeq lane with Adam and got back ~50 million reads. Going to add this data to the pilot analysis.

Calculate frequencies and export mean of viral barcodes

Because different samples are sequenced to different depths, viral barcodes are best compared as frequencies rather than absolute counts.

To the viral barcode QC notebook, I will add the following:

Calculate frequencies of viral barcodes within a sequencing sample.
Plot the correlation of these frequencies for technical replicates.
Take mean of the two replicates' frequencies. It is important to take the mean of the frequencies, rather than. the mean of the counts, so that samples with higher sequencing depth are not overweighted in the average.