GithubHelp home page GithubHelp logo

looselab / readfish Goto Github PK

View Code? Open in Web Editor NEW
167.0 14.0 31.0 4.98 MB

CLI tool for flexible and fast adaptive sampling on ONT sequencers

Home Page: https://looselab.github.io/readfish/

License: GNU General Public License v3.0

Python 100.00%
adaptive-sampling bioinformatics genomics ont oxford-nanopore sequencing

readfish's Introduction

If you are anything like us (Matt), reading a README is the last thing you do when running code. PLEASE DON'T DO THAT FOR READFISH. This will effect changes to your sequencing and - if you use it incorrectly - cost you money. We have added a list of GOTCHAs at the end of this README. We have almost certainly missed some... so - if something goes wrong, let us know so we can add you to the GOTCHA hall of fame!

Note

We also have more detailed documentation for your perusal at https://looselab.github.io/readfish

Note

Now also see our cool FAQ.

readfish is a Python package that integrates with the Read Until API.

The Read Until API provides a mechanism for an application to connect to a MinKNOW server to obtain read data in real-time. The data can be analysed in the way most fit for purpose, and a return call can be made to the server to unblock the read in progress and so direct sequencing capacity towards reads of interest.

This implementation of readfish requires Guppy version >= 6.0.0 and MinKNOW version core >= 5.0.0 . It will not work on earlier versions.

Since MinKNOW version core >=5.9.0 and Dorado server version >=7.3.9, Dorado requires an alternate library, ont-pybasecall-client-lib. We have introduced a newdorado module to handle this.

The code here has been tested with Guppy in GPU mode using GridION Mk1 and NVIDIA RTX2080 on live sequencing runs and an NVIDIA GTX1080 using playback on a simulated run (see below for how to test this). This code is run at your own risk as it DOES affect sequencing output. You are strongly advised to test your setup prior to running (see below for example tests).

Supported Sequencing Platforms

The following platforms are supported:

  • PromethION Big Boy
  • P2Solo Smol Big Boy
  • GridION Box
  • MinION Smol Boy

Warning

PromethION support is currently only available using the Mappy-rs plugin only. See here for more information.

Supported OS's

The following OSs are supported:

  • Linux yay
  • MacOS boo (Apple Silicon, Only with Dorado)

Note

Note - MacOS supports is on MinKNOW 5.7 and greater using Dorado basecaller on Apple Silicon devices only.

Citation

The paper is available at nature biotechnology and bioRxiv

If you use this software please cite: 10.1038/s41587-020-00746-x

Readfish enables targeted nanopore sequencing of gigabase-sized genomes Alexander Payne, Nadine Holmes, Thomas Clarke, Rory Munro, Bisrat Debebe, Matthew Loose Nat Biotechnol (2020); doi: https://doi.org/10.1038/s41587-020-00746-x

Other works

An update preprint is available at bioRxiv

Barcode aware adaptive sampling for Oxford Nanopore sequencers Alexander Payne, Rory Munro, Nadine Holmes, Christopher Moore, Matt Carlile, Matthew Loose bioRxiv (2021); doi: https://doi.org/10.1101/2021.12.01.470722

Installation

Our preferred installation method is via conda.

The environment is specified as:

name: readfish
channels:
  - bioconda
  - conda-forge
  - defaults
dependencies:
  - python=3.10
  - pip
  - pip:
    - readfish[all]

Saving the snippet above as readfish_env.yml and running the following commands will create the environment.

conda env create -f readfish_env.yml
conda activate readfish

Apple Silicon

Some users may encounter an issue with grpcio on apple silicon. This can be fixed by reinstalling grpcio as follows:

pip uninstall grpcio
GRPC_PYTHON_LDFLAGS=" -framework CoreFoundation" pip install grpcio --no-binary :all:

Installing with development dependencies

A conda yaml file is available for installing with dev dependencies - development.yml

curl -LO https://raw.githubusercontent.com/LooseLab/readfish/e30f1fa8ac7a37bb39e9d8b49251426fe1674c98/docs/development.yml?token=GHSAT0AAAAAACBZL42IS3QVM4ZGPPW4SHB6ZE67V6Q
conda env create -f development.yml
conda activate readfish_dev

‼️ Important !!

MinKNOW is transitioning from Guppy to Dorado. Until MinKNOW version 5.9 both Guppy and Dorado used ont-pyguppy-client-lib.
As of MinKNOW version 5.9 and Dorado server version 7.3.9 and greater Dorado requires an alternate library, ont-pybasecall-client-lib.
The listed ont-pyguppy-client-lib or ont-pybasecaller-client-lib version may not match the version installed on your system. To fix this, Please see this issue, using the appropriate library.

ONT's Guppy GPU should be installed and running as a server.

Alternatively, install readfish into a python virtual-environment
# Make a virtual environment
python3 -m venv readfish
. ./readfish/bin/activate
pip install --upgrade pip

# Install our readfish Software
pip install readfish[all]

# Install ont_pyguppy_client_lib that matches your guppy server version. E.G.
pip install ont_pyguppy_client_lib==6.3.8

Usage

usage: readfish [-h] [--version]  ...

positional arguments:
                   Sub-commands
    targets        Run targeted sequencing
    barcode-targets
                   Run targeted sequencing
    unblock-all    Unblock all reads
    validate       readfish TOML Validator

options:
  -h, --help       show this help message and exit
  --version        show program's version number and exit

See '<command> --help' to read about a specific sub-command.

TOML File

For information on the TOML files see TOML.md. There are several example TOMLS, with comments explaining what each field does, as well as the overall purpose of the TOML file here .

Testing

To test readfish on your configuration we recommend first running a playback experiment to test unblock speed and then selection.

The following steps should all happen with a configuration (test) flow cell inserted into the target device. A simulated device can also be created within MinKNOW, following these instructions. This assumes that you are runnning MinKNOW locally, using default ports. If this is not the case a developer API token is required on the commands as well, as well as setting the correct port.

If no test flow cell is available, a simulated device can be created within MinKNOW, following the below instructions.

Adding a simulated position for testing

  1. Linux

    In the readfish virtual environment we created earlier:

    • See help
    python -m minknow_api.examples.manage_simulated_devices --help
    • Add Minion position
    python -m minknow_api.examples.manage_simulated_devices --add MS00000
    • Add PromethION position
    python -m minknow_api.examples.manage_simulated_devices --prom --add S0
  2. Mac

    In the readfish virtual environment we created earlier:

    • See help
    python -m minknow_api.examples.manage_simulated_devices --help
    • Add Minion position
    python -m minknow_api.examples.manage_simulated_devices --add MS00000
    • Add PromethION position
    python -m minknow_api.examples.manage_simulated_devices --prom --add S0

As a back up it is possible to restart MinKNOW with a simulated device. This is done as follows:

  1. Stop minknow

    On Linux:

    cd /opt/ont/minknow/bin
    sudo systemctl stop minknow
  2. Start MinKNOW with a simulated device

    On Linux

    sudo ./mk_manager_svc -c /opt/ont/minknow/conf --simulated-minion-devices=1 &

You may need to add the host 127.0.0.1 in the MinKNOW UI.

Configuring bulk FAST5 file Playback

Download an open access bulk FAST5 file, either R9.4.1 4khz or R10 (5khz). This file is 21Gb so make sure you have sufficient space. A promethION bulkfile is also available but please note this is R10.4 4khz and so will give slightly unexpected results on MinKNOW which assumes 5khz. This file is approx 35Gb in size.

Previously to set up Playback using a pre-recorded bulk FAST5 file, it was necessary to edit the sequencing configuration file that MinKNOW uses. This is currently no longer the case. The "old method" steps are left after this section for reference only or if the direct playback from a bulk file option is removed in future.

To start sequencing using playback, simply begin setting up the run in the MinKNOW UI as you would usually. Under Run Options you can select Simulated Playback and browse to the downloaded Bulk Fast5 file.

Run Options Screenshot

[!NOTE]
Note - The below instructions, whilst they will still work, are no longer required. They are left here for reference only. As of Minknow 5.7, it is possible to select a bulk FAST5 file for playback in the MinKNOW UI.

Old method Configuring bulk FAST5 file Playback

To setup a simulation the sequencing configuration file that MinKNOW uses must be edited. Steps:
  1. Download an open access bulkfile - either R9.4.1 or R10 (5khz). These files are approximately 21Gb so make sure you have plenty of space. The files are from NA12878 sequencing data using either R9.4.1 or R10.4 pores. Data is not barcoded and the libraries were ligation preps from DNA extracted from cell lines.

  2. A promethION bulkfile is also available but please note this is R10.4, 4khz, and so will give slightly unexpected results on MinKNOW which assumes 5khz.

  3. Copy a sequencing TOML file to the user_scripts folder:

    On Mac if your MinKNOW output directory is the default:

    mkdir -p /Library/MinKNOW/data/user_scripts/simulations
    cp /Applications/MinKNOW.app/Contents/Resources/conf/package/sequencing/sequencing_MIN106_DNA.toml /Library/MinKNOW/data/user_scripts/simulations/sequencing_MIN106_DNA_sim.toml

    On Linux:

    sudo mkdir -p /opt/ont/minknow/conf/package/sequencing/simulations
    cp /opt/ont/minknow/conf/package/sequencing/sequencing_MIN106_DNA.toml /opt/ont/minknow/conf/package/sequencing/simulations/sequencing_MIN106_DNA_sim.toml
  4. Edit the copied file to add the following line under the line that reads "[custom_settings]":

    simulation = "/full/path/to/your_bulk.FAST5"
    

    Change the text between the quotes to point to your downloaded bulk FAST5 file.

  5. Optional, If running GUPPY in GPU mode, set the parameter break_reads_after_seconds = 1.0 to break_reads_after_seconds = 0.4. This results in a smaller read chunk. For R10.4 this is not required but can be tried. For adaptive sampling on PromethION, this should be left at 1 second.

  6. In the MinKNOW GUI, right click on a sequencing position and select Reload Scripts. Your version of MinKNOW will now playback the bulkfile rather than live sequencing.

  7. Start a sequencing run as you would normally, selecting the corresponding flow cell type to the edited script (here FLO-MIN106) as the flow cell type.

Whichever instructions you followed, the run should start and immediately begin a mux scan. Let it run for around five minutes after which your read length histogram should look as below: Control Image Screenshot

Testing unblock response

Now we shall test unblocking by running readfish unblock-all which will simply eject every single read on the flow cell.

  1. To do this run:
    readfish unblock-all --device <YOUR_DEVICE_ID> --experiment-name "Testing readfish Unblock All"
  2. Leave the run for a further 5 minutes and observe the read length histogram. If unblocks are happening correctly you will see something like the below: Unblock All Screenshot A closeup of the unblock peak shows reads being unblocked quickly: Closeup Unblock Image

If you are happy with the unblock response, move on to testing base-calling.

Note: The plots here are generated from running readfish unblock-all on an Apple Silicon laptop. The unblock response may be faster on a GPU server.

Testing base-calling and mapping

To test selective sequencing you must have access to either a guppy basecall server (>=6.0.0) or a dorado basecall server.

and a readfish TOML configuration file.

NOTE: guppy and dorado are used here interchangeably as the basecall server. Dorado is gradually replacing guppy. All readfish code is compatible with Guppy >=6.0.0 and dorado >=0.4.0

  1. First make a local copy of the example TOML file:

    curl -O https://raw.githubusercontent.com/LooseLab/readfish/master/docs/_static/example_tomls/human_chr_selection.toml
  2. If on PromethION, edit the mapper_settings.mappy section to read:

    [mapper_settings.mappy-rs]
  3. If on MinKNOW core>=5.9.0 and Dorado server version >=7.3.9, edit the basecaller section to read:

    [caller_settings.dorado]
  4. Modify the fn_idx_in field in the file to be the full path to a minimap2 index of the human genome.

  5. Modify the targets fields for each condition to reflect the naming convention used in your index. This is the sequence name only, up to but not including any whitespace. e.g. >chr1 human chromosome 1 would become chr1. If these names do not match, then target matching will fail.

We can now validate this TOML file to see if it will be loaded correctly.

readfish validate human_chr_selection.toml

Errors with the configuration will be written to the terminal along with a text description of the conditions for the experiment as below.

2023-10-05 15:29:18,934 readfish /home/adoni5/mambaforge/envs/readfish_dev/bin/readfish validate human_chr_selection.toml
2023-10-05 15:29:18,934 readfish command='validate'
2023-10-05 15:29:18,934 readfish log_file=None
2023-10-05 15:29:18,934 readfish log_format='%(asctime)s %(name)s %(message)s'
2023-10-05 15:29:18,934 readfish log_level='info'
2023-10-05 15:29:18,934 readfish no_check_plugins=False
2023-10-05 15:29:18,934 readfish no_describe=False
2023-10-05 15:29:18,934 readfish prom=False
2023-10-05 15:29:18,934 readfish toml='human_chr_selection.toml'
2023-10-05 15:29:18,934 readfish.validate eJydVk1v2zgQvetXEMqlxdryxyZAGyAHt0WKAk1TNNlTkBVoiZKIUKQiUonTX79vSEmW2zRo1/BBIkdvZt68GfKIXXV1zdunU3Z9efGZZUYXsmSFVIIVpmWt4GruZC3YlluRcaWkLmdM6FZmFR7JKDpi7tGwrGpNbayphWWv8MLWS8Z1ztar16zAFnOVYFVXc81KoWHGpGacWaDAWStKaXQCrOtK2v6V8aZREnjOMLiGC661UJbxrDXWesTHylCsyjxmTCiVRAOE2PG6wRYeQ1ZdK/KQVKc1xf4oXYUQ2be3yXGyYttO3e0zd8I6GHm8jxSvzJjjbSkc3LeC2UZkspCAzGUrMqeeKB+KyJlaBZx+AXA1UBiLEYiTZYwXeOD2dLo6s4B3M6FzPLVgjswokj6RGQyrdj1bzlbL5eyvOCYMvxQTbZ8KqsCa0h1Dm3n3AugIOHhB0iBy61+tzAVxwvvEZgyUbw1ICQHYUA5hBWu4c6LVoBJ84eta7gjeG0/TRki+qklmHzwHnr8NXJrCW20FKoUdofLAo9g1ikuNMPBhbbCSC8elGmBzk3U1UuCOBDEHWuVcY08XC2WMFYpvkxJ17LaJNAvINS+krRYUTFK5WpH7d710RTsqwaNFN2E1tcJRrW1Sdk3zdOurMv39639szuJgEY8UW2Y6oFaIRI8tIqglPpKhX5r3bXPoPChE81pEfdOdsTjXPG1XS5JjKt4k6/R4udw2Nj25q76nBbeONLHJ81ZA/T2j5eioz3FONQOLBe+UY7y3JiUFUyhENhkIXLi6WYSMFif4JdFgjFCeNyH/54jjHnm7pnMeVupcPsi844rmBWQTGhD/y6/Xny6/bD4jJu7bFVKivAcdZTQG7jvpBFMkQRLcN1GbB7xDE9T3ubR8SzrKxbYrU2U8UUo+iDQ4K+7joDFZamT//rDCNUbItML0/nKFvcW0wn6BlO0f5q3tc8FI8i4H56RSEFCgp3TmY++sSNhVZTqVU9MI6BQRnm+urjd+AGh2cfEpKnQq810KvSOxRQVKFjw3Wp4sPvTat4t30khNcwRpZRY6L+yiKv9+k2qTcuXAAk/K74nFuHRhA8MgvUjqWlLJLhtiA/X5ujmfVo5oDGl4N39G/+S7hhfk5ktXb5GgF6YvziAPsSP9vwpM0qEwUPl6fNsbEMNGq6fXkU4HnDN2HPng/LFwUGMUzQrxZ1PhiIOMJyvtPBw0IVA/fUaaw2n0QRRSS+dtkFfc6a0y2V2Mady0JhMij30KsXWmgSIzgVaA0GI/3DgBK0w8G0UksyPWf3RKMxGD0IBl7zarOn1HhNNo5o2jwwrzVRS06donoge7Nb8DqjZeSDkUahHZjMlEDMg+E20eB4d9wKfsn/DglUuMDAaHgQ+BDVZ9SFbcd6RqxETZ0jds/AZneEni8jn0NXc/HuX+3An3huFmkdux8s7fH/obAz2tkujmpi/O7W1Ec5KEh/tDSidzHNVSp73DM7ZCHliQdVczPYqw3+5J5CNf4yHGcxHVfLfHOSYcvnseJzQ0nUsjEMpB0WO6FTgeQRryxbghBYUJsUUvpRMXPPNjabhInLGb+Mv7dLlav10vk1V8SwX58bbhbyN7JqNwY0qNnxeH1Yvp+433QeE6Uov0x0TrL0Jebj3lZOI9JFGNg0L+CvBlRC9eh3vZFHvWX60cUwKHhd8afA3RFwV5G9ppmMO/HT3pRDq/CqQAPuTxPPQL2NSq/lu6L/YeLJIIm9AtGez9JBGmLjqCvIxDYPJ7MQltxmaazhqChOdnA/8NyIGWKeJP4jvA/gXiz7IPWqD7mQ16ZnvIyF/n0oNenFDyv3yEG+IeMvoPUvBL7w==
2023-10-05 15:29:18,937 readfish.validate Loaded TOML config without error
2023-10-05 15:29:18,937 readfish.validate Initialising Caller
2023-10-05 15:29:18,945 readfish.validate Caller initialised
2023-10-05 15:29:18,945 readfish.validate Initialising Aligner
2023-10-05 15:29:18,947 readfish.validate Aligner initialised
2023-10-05 15:29:18,948 readfish.validate Configuration description:
Region hum_test (control=False).
Region applies to section of flow cell (# = applied, . = not applied):

    ################################
    ################################
    ################################
    ################################
    ################################
    ################################
    ################################
    ################################

2023-10-05 15:29:18,948 readfish.validate Using the mappy plugin. Using reference: /home/adoni5/Documents/Bioinformatics/refs/hg38_no_alts.fa.gz.split/hg38_chr_M.mmi.

Region hum_test has targets on 1 contig, with 1 found in the provided reference.
This region has 2 total targets (+ve and -ve strands), covering approximately 100.00% of the genome.
  1. If your toml file validates then run the following command:

  2. readfish targets --toml <PATH_TO_TOML> --device <YOUR_DEVICE_ID> --log-file test.log --experiment-name human_select_test
  3. In the terminal window you should see messages reporting the speed of mapping of the form:

    2023-10-05 15:24:03,910 readfish.targets MinKNOW is reporting PHASE_MUX_SCAN, waiting for PHASE_SEQUENCING to begin.
    2023-10-05 15:25:48,150 readfish._read_until_client Protocol phase changed to PHASE_SEQUENCING
    2023-10-05 15:25:48,724 readfish.targets 0494R/0.5713s; Avg: 0494R/0.5713s; Seq:0; Unb:494; Pro:0; Slow batches (>1.00s): 0/1
    2023-10-05 15:25:52,132 readfish.targets 0004R/0.1831s; Avg: 0249R/0.3772s; Seq:0; Unb:498; Pro:0; Slow batches (>1.00s): 0/2
    2023-10-05 15:25:52,600 readfish.targets 0122R/0.2494s; Avg: 0206R/0.3346s; Seq:0; Unb:620; Pro:0; Slow batches (>1.00s): 0/3
    2023-10-05 15:25:52,967 readfish.targets 0072R/0.2144s; Avg: 0173R/0.3046s; Seq:0; Unb:692; Pro:0; Slow batches (>1.00s): 0/4
    2023-10-05 15:25:53,349 readfish.targets 0043R/0.1932s; Avg: 0147R/0.2823s; Seq:0; Unb:735; Pro:0; Slow batches (>1.00s): 0/5
    2023-10-05 15:25:53,759 readfish.targets 0048R/0.2011s; Avg: 0130R/0.2688s; Seq:0; Unb:783; Pro:0; Slow batches (>1.00s): 0/6
    2023-10-05 15:25:54,206 readfish.targets 0126R/0.2458s; Avg: 0129R/0.2655s; Seq:0; Unb:909; Pro:0; Slow batches (>1.00s): 0/7
    2023-10-05 15:25:54,580 readfish.targets 0082R/0.2180s; Avg: 0123R/0.2595s; Seq:0; Unb:991; Pro:0; Slow batches (>1.00s): 0/8
    2023-10-05 15:25:54,975 readfish.targets 0053R/0.2110s; Avg: 0116R/0.2542s; Seq:0; Unb:1,044; Pro:0; Slow batches (>1.00s): 0/9
    2023-10-05 15:25:55,372 readfish.targets 0057R/0.2051s; Avg: 0110R/0.2492s; Seq:0; Unb:1,101; Pro:0; Slow batches (>1.00s): 0/10
    2023-10-05 15:25:55,817 readfish.targets 0135R/0.2467s; Avg: 0112R/0.2490s; Seq:0; Unb:1,236; Pro:0; Slow batches (>1.00s): 0/11
    2023-10-05 15:25:56,192 readfish.targets 0086R/0.2206s; Avg: 0110R/0.2466s; Seq:0; Unb:1,322; Pro:0; Slow batches (>1.00s): 0/12
    2023-10-05 15:25:56,588 readfish.targets 0060R/0.2138s; Avg: 0106R/0.2441s; Seq:0; Unb:1,382; Pro:0; Slow batches (>1.00s): 0/13
    2023-10-05 15:25:56,989 readfish.targets 0060R/0.2123s; Avg: 0103R/0.2418s; Seq:0; Unb:1,442; Pro:0; Slow batches (>1.00s): 0/14
    2023-10-05 15:25:57,429 readfish.targets 0133R/0.2502s; Avg: 0105R/0.2424s; Seq:0; Unb:1,575; Pro:0; Slow batches (>1.00s): 0/15
    2023-10-05 15:25:57,809 readfish.targets 0089R/0.2280s; Avg: 0104R/0.2415s; Seq:0; Unb:1,664; Pro:0; Slow batches (>1.00s): 0/16
    2023-10-05 15:25:58,210 readfish.targets 0059R/0.2247s; Avg: 0101R/0.2405s; Seq:0; Unb:1,723; Pro:0; Slow batches (>1.00s): 0/17
    ^C2023-10-05 15:25:58,238 readfish.targets Keyboard interrupt received, stopping readfish
    
WARNING
Note: if these times are longer than the number of seconds specified in the break read chunk in the sequencing TOML, you will have performance issues. Contact us via github issues for support.

This log is a little dense at first. Moving from left to right, we have:

[Date Time] [Logger Name] [Batch Stats]; [Average Batch Stats]; [Count commands sent]; [Slow Batch Info]

Using the provided log as an example:

On 2023-10-05 at 15:25:56,989, the Readfish targets command logged a batch of read signal:

- It saw 60 reads in the current batch.
- The batch took 0.2123 seconds.
- On average, batches are 103 reads, which are processed in 0.2418 seconds.
- Since the start, 0 reads were sequenced, 1,442 reads were unblocked, and 0 reads were asked to proceed.
- Out of 14 total batches processed, 0 were considered slow (took more than 1 second).

The important thing to note here is that the average batch time is less than the break read chunk time in the sequencing TOML. The slow batch section will show the number of batches that were slower than break reads. If the average is lower, or the slow batch count is high, you will have performance issues. Contact us via github issues for support.

If you are happy with the speed of mapping, move on to testing a selection.

Testing expected results from a selection experiment.

The only way to test readfish on a playback run is to look at changes in read length for rejected vs accepted reads. To do this:

  1. Start a fresh simulation run using the bulkfile provided above.
  2. Restart the readfish command (as above):
    readfish targets --toml <PATH_TO_TOML> --device <YOUR_DEVICE_ID> --log-file test.log --experiment-name human_select_test
  3. Allow the run to proceed for at least 15 minutes (making sure you are writing out read data!).
  4. After 15 minutes it should look something like this: Playback Unblock Image If one zooms in on the unblock peak: Closeup Playback Unblock Image And if one zooms to exclude the unblock peak: Closeup Playback On Target Image NOTE: These simulations are also run on Apple Silicon - GPU platform performance may vary - please contact us via github issues for support.

Analysing results with readfish stats

Once a run is complete, it can be analysed with the readfish stats command.

HTML file output is optional.

readfish stats --toml <path/to/toml/file.toml> --fastq-directory  <path/to/run/folder> --html <filename>

Readfish stats will use the initial experiment configuration to analyse the final sequence data and output a formatted table to the screen. The table is broken into two sections. For clarity these are shown individually below.

In the first table, the data is summarised by condition as defined in the TOML file. In this example we have a single Region - "hum_test". The total number of reads is shown, along with the number of alignments broken down into On-Target and Off-Target. In addition, we show yield, median read length and a summary of the number of targets.

ConditionReadsAlignmentsYieldMedian read lengthsNumber of targetsPercent targetEstimated coverage
On-TargetOff-TargetTotalOn-TargetOff-TargetTotalRatioOn-targetOff-targetCombined
hum_test112,058819 (0.73%)111,239 (99.27%)112,0589.27 Mb (5.49%)159.43 Mb (94.51%)168.69 Mb1:17.200 b896 b896 b23.60%0.08 X
On-TargetOff-TargetTotalOn-TargetOff-TargetTotalRatioOn-targetOff-targetCombined
ConditionReadsAlignmentsYieldMedian read lengthsNumber of targetsPercent targetEstimated coverage

The lower portion of the table shows the data broken down by contig in the reference (and so can be very long if using a complex reference!). Again data are broken down by On and Off target. Read counts, yield, median and N50 read lengths are presented. Finally we estimate the proportion of reads on target and an estimate of coverage.

In this experiment, we were targeting chromosomes 20 and 21. As this is a playback run there is no effect on yield but you can see a clear effect on read length. The read length N50 and Median is higher for chromosomes 20 and 21 as expected. If running on more performant systems, the anticipated difference would be higher.

Condition Namehum_test
ConditionContigContig LengthReadsAlignmentsYieldMedian read lengthsN50Number of targetsPercent targetEstimated coverage
MappedUnmappedTotalOn-TargetOff-TargetTotalOn-TargetOff-TargetTotalRatioOn-targetOff-targetCombinedOn-TargetOff-TargetTotal
hum_testchr1248,956,42210,015010,0156 (0.06%)10,009 (99.94%)10,01548.65 Kb (0.37%)13.03 Mb (99.63%)13.08 Mb1:267.870 b891 b891 b0 b1.35 Kb1.35 Kb00.00%0.00 X
hum_testchr2242,193,5298,82508,8259 (0.10%)8,816 (99.90%)8,82547.36 Kb (0.36%)13.05 Mb (99.64%)13.09 Mb1:275.510 b894 b894 b0 b1.49 Kb1.49 Kb00.00%0.00 X
hum_testchr3198,295,5598,00508,0056 (0.07%)7,999 (99.93%)8,005193.03 Kb (1.73%)10.98 Mb (98.27%)11.17 Mb1:56.860 b893 b893 b0 b1.42 Kb1.42 Kb00.00%0.00 X
hum_testchr4190,214,5557,38107,38130 (0.41%)7,351 (99.59%)7,381861.07 Kb (7.29%)10.95 Mb (92.71%)11.81 Mb1:12.720 b917 b917 b0 b1.60 Kb1.60 Kb00.00%0.00 X
hum_testchr5181,538,2597,54507,5455 (0.07%)7,540 (99.93%)7,54550.70 Kb (0.50%)10.18 Mb (99.50%)10.23 Mb1:200.680 b896 b896 b0 b1.40 Kb1.40 Kb00.00%0.00 X
hum_testchr6170,805,9795,80805,8089 (0.15%)5,799 (99.85%)5,808116.44 Kb (1.35%)8.53 Mb (98.65%)8.65 Mb1:73.280 b905 b905 b0 b1.49 Kb1.49 Kb00.00%0.00 X
hum_testchr7159,345,9736,38306,3832 (0.03%)6,381 (99.97%)6,38326.06 Kb (0.29%)9.11 Mb (99.71%)9.14 Mb1:349.590 b895 b895 b0 b1.44 Kb1.44 Kb00.00%0.00 X
hum_testchr8145,138,6365,20805,2081 (0.02%)5,207 (99.98%)5,208285 b (0.00%)7.43 Mb (100.00%)7.43 Mb1:26061.600 b892 b892 b0 b1.44 Kb1.44 Kb00.00%0.00 X
hum_testchr9138,394,7174,25304,25323 (0.54%)4,230 (99.46%)4,25391.15 Kb (1.50%)6.00 Mb (98.50%)6.09 Mb1:65.850 b899 b899 b0 b1.46 Kb1.46 Kb00.00%0.00 X
hum_testchr10133,797,4224,42404,42415 (0.34%)4,409 (99.66%)4,42495.02 Kb (1.37%)6.86 Mb (98.63%)6.95 Mb1:72.160 b915 b915 b0 b1.56 Kb1.56 Kb00.00%0.00 X
hum_testchr11135,086,6225,34905,3491 (0.02%)5,348 (99.98%)5,349287 b (0.00%)6.89 Mb (100.00%)6.89 Mb1:23997.500 b896 b896 b0 b1.35 Kb1.35 Kb00.00%0.00 X
hum_testchr12133,275,3095,50805,5083 (0.05%)5,505 (99.95%)5,5082.63 Kb (0.03%)7.59 Mb (99.97%)7.59 Mb1:2888.960 b893 b893 b0 b1.40 Kb1.40 Kb00.00%0.00 X
hum_testchr13114,364,3283,41403,4148 (0.23%)3,406 (99.77%)3,41485.71 Kb (1.80%)4.69 Mb (98.20%)4.77 Mb1:54.670 b900 b900 b0 b1.43 Kb1.43 Kb00.00%0.00 X
hum_testchr14107,043,7183,54103,54112 (0.34%)3,529 (99.66%)3,541244.18 Kb (4.79%)4.86 Mb (95.21%)5.10 Mb1:19.900 b892 b892 b0 b1.42 Kb1.42 Kb00.00%0.00 X
hum_testchr15101,991,1893,03303,0333 (0.10%)3,030 (99.90%)3,0334.29 Kb (0.11%)3.79 Mb (99.89%)3.80 Mb1:883.070 b867 b867 b0 b1.31 Kb1.31 Kb00.00%0.00 X
hum_testchr1690,338,3453,27603,2761 (0.03%)3,275 (99.97%)3,2761.97 Kb (0.04%)4.51 Mb (99.96%)4.51 Mb1:2294.280 b900 b900 b0 b1.41 Kb1.41 Kb00.00%0.00 X
hum_testchr1783,257,4413,37803,37810 (0.30%)3,368 (99.70%)3,37816.81 Kb (0.36%)4.72 Mb (99.64%)4.73 Mb1:280.520 b907 b907 b0 b1.43 Kb1.43 Kb00.00%0.00 X
hum_testchr1880,373,2853,15803,1583 (0.09%)3,155 (99.91%)3,158186.59 Kb (4.06%)4.41 Mb (95.94%)4.59 Mb1:23.610 b899 b899 b0 b1.47 Kb1.47 Kb00.00%0.00 X
hum_testchr1958,617,6162,11002,1100 (0.00%)2,110 (100.00%)2,1100 b (0.00%)2.53 Mb (100.00%)2.53 Mb0:0.000 b857 b857 b0 b1.27 Kb1.27 Kb00.00%0.00 X
hum_testchr2064,444,1673700370370 (100.00%)0 (0.00%)3703.60 Mb (100.00%)0 b (0.00%)3.60 Mb1:0.000 b2.88 Kb2.88 Kb0 b32.28 Kb32.28 Kb1100.00%0.06 X
hum_testchr2146,709,9832650265265 (100.00%)0 (0.00%)2653.06 Mb (100.00%)0 b (0.00%)3.06 Mb1:0.000 b2.63 Kb2.63 Kb0 b33.54 Kb33.54 Kb1100.00%0.07 X
hum_testchr2250,818,4681,74101,74128 (1.61%)1,713 (98.39%)1,741421.99 Kb (14.61%)2.47 Mb (85.39%)2.89 Mb1:5.850 b922 b922 b0 b1.63 Kb1.63 Kb00.00%0.00 X
hum_testchrM16,569190190 (0.00%)19 (100.00%)190 b (0.00%)16.82 Kb (100.00%)16.82 Kb0:0.000 b774 b774 b0 b1.11 Kb1.11 Kb00.00%0.00 X
hum_testchrX156,040,8955,63605,6365 (0.09%)5,631 (99.91%)5,6363.19 Kb (0.04%)7.46 Mb (99.96%)7.46 Mb1:2336.710 b905 b905 b0 b1.38 Kb1.38 Kb00.00%0.00 X
hum_testchrY57,227,41511601164 (3.45%)112 (96.55%)116117.28 Kb (27.65%)306.90 Kb (72.35%)424.19 Kb1:2.620 b989 b989 b0 b28.59 Kb28.59 Kb00.00%0.00 X
hum_testunmapped003,2973,2970 (0.00%)3,297 (100.00%)3,2970 b (0.00%)9.10 Mb (100.00%)9.10 Mb0:0.000 b508 b508 b0 b16.81 Kb16.81 Kb00.00%0.00 X

Common Gotcha's

These may or may not (!) be mistakes we have made already...

  1. If the previous run has not fully completed - i.e is still base-calling or processing raw data,you may connect to the wrong instance and see nothing happening. Always check the previous run has finished completely.
  2. If you have forgotten to remove your simulation line from your sequencing toml you will forever be trapped in an inception like resequencing of old data... Don't do this!
  3. If base-calling doesn't seem to be working check:
    • Check your base-calling server is running.
    • Check the ip of your server is correct.
    • Check the port of your server is correct.
  4. If you are expecting reads to unblock but they do not - check that you have set control=false in your readfish toml file. control=true will prevent any unblocks but does otherwise run the full analysis pipeline.
  5. Oh no - every single read is being unblocked - I have nothing on target!
    • Double check your reference file is in the correct location.
    • Double check your targets exist in that reference file.
    • Double check your targets are correctly formatted with contig name matching the record names in your reference (Exclude description - i.e the contig name up to the first whitespace).
  6. Where has my reference gone? If you are using a _live TOML file - e.g running iter_align or iter_cent, the previous reference MMI file is deleted when a new one is added. This obviously saves on disk space use(!) but can lead to unfortunate side effects - i.e you delete your MMI file. These can of course be recreated but user beware.

Happy readfish-ing!

Acknowledgements

We're really grateful to lots of people for help and support. Here's a few of them...

From the lab: Teri Evans, Sam Holt, Lewis Gallagher, Chris Alder, Thomas Clarke

From ONT: Stu Reid, Chris Wright, Rosemary Dokos, Chris Seymour, Clive Brown, George Pimm, Jon Pugh

From the Nanopore World: Nick Loman, Josh Quick, John Tyson, Jared Simpson, Ewan Birney, Alexander Senf, Nick Goldman, Miten Jain, Lukas Weilguny

And for our Awesome Logo please checkout out @tim_bassford from @TurbineCreative!

Changelog

2024.2.0

  1. Add a dorado base-caller which addressed issue #347 - chiefly in Dorado 7.3.9 ONT have moved to ont-pybasecall-client-lib, and connections from ont_pyguppy_client_lib raise Connection error. ... LOAD_CONFIG. Reply: INVALID_PROTOCOL (#344)
  2. Adds version checking for MinKNOW and Guppy/Dorado, logs if not compatibile (#351)

2024.1.0

  1. bug fix type for --wait-on-ready type and actual function (#327), (#323)
  2. mutiple suffix .mmi support (#330)
  3. Change the default unblock_duration on the Analysis class to use DEFAULT_UNBLOCK value defined in _cli_args.py. Change type on the Argparser for --unblock-duration to float. (#313)
  4. Big dog Duplex feature - adds ability to select duplex reads that cover a target region. See pull request for details (#324)

2023.1.1

  1. Fix Readme Logo link 🥳 (#296)
  2. Fix bug where we had accidentally started requiring barcoded TOMLs to specify a region. Thanks to @jamesemery for catching this. (#299)
  3. Correctly handle overriding a decision in internal statistics tracking. (#299)

readfish's People

Contributors

adoni5 avatar alexomics avatar mattloose avatar svennd avatar thomassclarke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

readfish's Issues

RU crashes when pyguppy returns NoneType

Hello,

We had a read-until run crash recently after running successfully for ~9 hours. Sequencing proceeded as normal, but the adaptive selection ceased to operate after this point. The goal of the run was to deplete reads that align to the target reference (unblock) and allow all non-mapping reads to proceed.

Here's the contents of the toml file:

[caller_settings]
config_name = "dna_r9.4.1_450bps_hac"
host = "127.0.0.1"
port = 5555

[conditions]
reference = "/home/grid/git/pyguppyclient/read_until/assembly.no_eupl.combo.ont.mmi"

[conditions.0]
name = "bac_background_depletion"
control = false
min_chunks = 0
max_chunks = inf
targets = ["combined_bac_contigs"]
single_on = "unblock"
multi_on = "unblock"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"`

And here are the last few lines of the read-until log:

2020-03-09 21:55:17,987 DEC 9655 167 3055fd3e-a776-47be-8dd2-41cb6364ee17 298 20068 8 1 no_map proceed
2020-03-09 21:55:17,988 DEC 9655 168 746b3d88-9c1b-4768-8495-b62daebfff5c 299 17338 4816 4 no_map proceed
2020-03-09 21:55:17,988 DEC 9655 169 9edd601d-bbf4-4cb5-b282-c1df702abf35 5 17692 887 2 no_map proceed
2020-03-09 21:55:17,990 ru.ru_gen 169R/5.10888s
2020-03-11 13:08:09,670 Manager Sending reset
2020-03-11 13:08:09,800 Manager EXCEPT
Traceback (most recent call last):
File "/home/grid/git/pyguppyclient/read_until/lib/python3.7/site-packages/ru/ru_gen.py", line 397, in run_workflow
res = result.get(3)
File "/home/grid/miniconda3/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
File "/home/grid/miniconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/grid/git/pyguppyclient/read_until/lib/python3.7/site-packages/ru/ru_gen.py", line 207, in simple_analysis
decided_reads=decided_reads,
File "/home/grid/git/pyguppyclient/read_until/lib/python3.7/site-packages/ru/basecall.py", line 146, in map_reads_2
for read_info, read_id, seq, seq_len, quality in calls:
File "/home/grid/git/pyguppyclient/read_until/lib/python3.7/site-packages/ru/basecall.py", line 101, in basecall_minknow
self.pass_read(read)
File "/home/grid/git/pyguppyclient/read_until/lib/python3.7/site-packages/pyguppyclient/client.py", line 123, in pass_read
), simple=False)
File "/home/grid/git/pyguppyclient/read_until/lib/python3.7/site-packages/pyguppyclient/client.py", line 55, in send
return simple_response(self.recv())
File "/home/grid/git/pyguppyclient/read_until/lib/python3.7/site-packages/pyguppyclient/ipc.py", line 101, in simple_response
raise Exception(cls.Text().decode())
AttributeError: 'NoneType' object has no attribute 'decode'

Any guidance on what happened here? It could be that this is a pyguppy bug, but I thought I'd first run it by you read-until folks to see if you have encountered this before. I'm hesitant to simply relaunch the run without debugging as this is a precious sample.

Thanks,
John

counter intuitive result

We have simulated a run that we would like to do based on your bulk file. But we get really counter-intuitive results.
The toml we have used is built like this:

[caller_settings]
config_name = "dna_r9.4.1_450bps_fast"
host = "localhost"
port = 5550
​
[conditions]
reference = "some.mmi"
axis = 1
​
[conditions.0]
name = "control"
control = true
min_chunks = 0
max_chunks = 4
targets = ["p"]
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"
​
[conditions.1]
name = "enrich_t1"
control = false
min_chunks = 0
max_chunks = 4
targets = ["t1"]
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"
​
[conditions.2]
name = "enrich_ts"
control = false
min_chunks = 0
max_chunks = 4
targets = ["t1","t2","t3","t4","t5","t6"]
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"
​
[conditions.3]
name = "deplete_ts"
control = false
min_chunks = 0
max_chunks = 4
targets = ["t1","t2","t3","t4","t5","t6"]
single_on = "unblock"
multi_on = "unblock"
single_off = "stop_receiving"
multi_off = "stop_receiving"
no_seq = "proceed"
no_map = "proceed"

Afterwards I split the fastq according to the channel in each reads header with something like the following:

run_info, conditions, reference, caller_settings = get_run_info(toml_file)
channel = int(fastq_header.split()[4].split("=")[1])
field = run_info[channel]

After I utilized summarise_fq.py on each conditions fastq and looked at the sum column as a proxy for whether the enrichment has worked or not.
And I see:

Conditions:               0        1        2         3
description:              control  t1_enr   ts_enr    ts_depl
Total bases in ts:        334294   30561    7069      1210683
Normalized:               1        0.091    0.021     3.622

So it's the complete opposite of what I expected. Do you have an idea of what I am doing wrong ?

Guppy_CPU support?

Hi,

I was wondering if about using ru without a GPU. Our Server has 40 cores (80 threads), so I think we might be able to use ont-guppy-cpu_3.4.1_linux64.tar.gz as it comes with ont-guppy-cpu/bin/guppy_basecall_server.

Any idea if guppy_basecall_server could keep up with a MinION using CPUs?

Set messages level to 2

Currently messages are sent as normal levels, but should be pushed as warnings as they will change flowcell behaviour.

setup help on minion

hi, I'm trying to get this set up on a minion. I have this as far as getting ru_unblock_all to work on the playback. I am not able to get the example with selective unblock going. It looks like reads are never getting sent to minimap2. The stderrr looks like:

020-03-09 16:50:08,856 Manager /home/quinlan/.local/bin/ru_generators --device MN18894 --experiment-name RU Test basecall and map --toml /data/human_chr_selection.toml --log-file human_chr_selection.log
2020-03-09 16:50:08,856 Manager batch_size=512
2020-03-09 16:50:08,856 Manager cache_size=512
2020-03-09 16:50:08,856 Manager channels=[1, 512]
2020-03-09 16:50:08,856 Manager chunk_log=chunk_log.log
2020-03-09 16:50:08,856 Manager device=MN18894
2020-03-09 16:50:08,856 Manager dry_run=False
2020-03-09 16:50:08,856 Manager experiment_name=RU Test basecall and map
2020-03-09 16:50:08,856 Manager host=127.0.0.1
2020-03-09 16:50:08,856 Manager log_file=human_chr_selection.log
2020-03-09 16:50:08,856 Manager log_format=%(asctime)s %(name)s %(message)s
2020-03-09 16:50:08,856 Manager log_level=info
2020-03-09 16:50:08,856 Manager paf_log=paflog.log
2020-03-09 16:50:08,856 Manager port=9501
2020-03-09 16:50:08,856 Manager read_cache=AccumulatingCache
2020-03-09 16:50:08,856 Manager run_time=172800
2020-03-09 16:50:08,856 Manager throttle=0.1
2020-03-09 16:50:08,856 Manager toml=/data/human_chr_selection.toml
2020-03-09 16:50:08,857 Manager unblock_duration=0.1
2020-03-09 16:50:08,857 Manager workers=1
2020-03-09 16:50:08,860 Manager Initialising minimap2 mapper
2020-03-09 16:50:16,204 Manager Mapper initialised
2020-03-09 16:50:16,204 read_until_api_v2.main Client type: many chunk
2020-03-09 16:50:16,204 read_until_api_v2.main Cache type: AccumulatingCache
2020-03-09 16:50:16,204 read_until_api_v2.main Filter for classes: adapter and strand
2020-03-09 16:50:16,204 read_until_api_v2.main Creating rpc connection for device MN18894.
2020-03-09 16:50:16,480 read_until_api_v2.main Loaded RPC
2020-03-09 16:50:16,481 read_until_api_v2.main Signal data-type: int16
2020-03-09 16:50:16,482 Manager This experiment has 1 region on the flowcell
2020-03-09 16:50:16,482 Manager Using reference: /data/human_g1k_v38_decoy_phix.fasta.mmi
2020-03-09 16:50:16,483 Manager Region 'select_chr_21_22' (control=False) has 2 targets of which 2 are in the reference. Reads will be unblocked when classed as single_off or multi_off; sequenced when classed as single_on or multi_on; and polled for more data when classed as no_map or no_seq.
2020-03-09 16:50:16,484 Manager Creating 1 workers
2020-03-09 16:50:16,484 read_until_api_v2.main Processing started
2020-03-09 16:50:16,485 read_until_api_v2.main Sending init command, channels:1-512, min_chunk:0

and then it just stalls there.

Other potentially useful information:

  • I have started the run in minKNOW (both with and without minKNOW base-calling -- and trying an external guppy base-calling server)
  • if I enter a different PORT, then it fails as it can't connect.
  • here is the command I am using:
ru_generators --device $DEVICE \
              --port $PORT \
              --experiment-name "RU Test basecall and map" \
              --toml $TOML \
              --log-file $(basename $TOML .toml).log
  • In minKNOW I can see the run proceeding un-interrrupted (whereas with ru_unblock_all I can see the size distribution drop).

Anything else I can report or try to help diagnose?

ru_generators

ru_generators is the only script for interfacing with MinKNOW - I suggest we use the ru_ prefix for things which are actually doing read until - and then have some other prefix for non read until scripts.

SO ru_generators should become readuntil

ru_unblock_all is a test for making sure that MinKNOW is working and responding as expected. I suggest we rename to:

ru_test_unblock_all

ru_raw_signal_log is something which might be interesting and we should investigate.

ru_validate is a convenience wrapper for toml validation for read until - I suggest we call this:

ru_toml_validate

ru_iteralign is iteralign

and ru_iteralign_centrifuge becomes:

itercent

Applying to direct RNA seqeuencing

Hi,

I was wondering what do I need to change to allow this program to work on direct RNA sequencing?

Fortunately, I have kept a bulk file from one of my direct RNA sequencing runs and I will be able to test on that.

Question about unblocking Odd or Even pores

Hi,
Sorry if I have missed this. I am trying to run a test by only unblocking odd or even pores. How can I achieve this? I see the axis options in the toml file, should I use these?
Thanks,
Nick

Optimising read until on miniT

Hi Alex, Matt & team,
We have a miniT that runs our minion and inspired by #28 I had a go at installing and running ru on it. It works, right up to the basecalling & mapping and then I am having "performance issues". See my mapping times below. Not sure if it is worth continuing or if this is never going to work. Happy to provide additional logs or info.
Cheers,
Mark

2020-04-30 18:34:19,771 ru.ru_gen 74R/1.25966s
2020-04-30 18:34:23,165 ru.ru_gen 148R/3.39248s
2020-04-30 18:34:28,704 ru.ru_gen 174R/5.53804s
2020-04-30 18:34:39,260 ru.ru_gen 247R/10.55593s
2020-04-30 18:34:45,618 ru.ru_gen 241R/6.34544s
2020-04-30 18:34:55,137 ru.ru_gen 242R/9.51814s
2020-04-30 18:35:10,468 ru.ru_gen 248R/15.33125s
2020-04-30 18:35:27,624 ru.ru_gen 257R/17.14404s
2020-04-30 18:35:46,549 ru.ru_gen 350R/18.92046s
2020-04-30 18:36:10,766 ru.ru_gen 346R/24.21621s
2020-04-30 18:36:39,124 ru.ru_gen 341R/28.34917s
2020-04-30 18:37:15,932 ru.ru_gen 355R/36.79682s
2020-04-30 18:38:03,242 ru.ru_gen 393R/47.28662s
2020-04-30 18:39:07,453 ru.ru_gen 409R/64.18478s
2020-04-30 18:40:24,991 ru.ru_gen 410R/77.52404s
2020-04-30 18:41:54,450 ru.ru_gen 421R/89.44561s
2020-04-30 18:43:21,877 ru.ru_gen 426R/87.40666s

For interest and info, I have solved a few installation traps on the miniT. No guarantees are given or implied.

sudo apt update
sudo apt upgrade # If required
sudo apt install python3-venv python3-dev libzmq3-dev libhdf5-dev screen
# Fetch aarch64 binary version of guppy basecaller >3.4 from Oxford Nanopore. eg.
wget https://mirror.oxfordnanoportal.com/software/analysis/ont-guppy_3.5.2_linuxaarch64.tar.gz
tar -xvf ont-guppy_3.5.2_linuxaarch64.tar.gz
python3 -m venv read_until
. ./read_until/bin/activate
pip install --upgrade pip
pip install git+https://github.com/LooseLab/read_until_api_v2@master
# Install Cython and h5py separately, limiting the version of h5py or you will get: AttributeError: module 'h5py.h5pl' has no attribute 'prepend'
# You can compile h5py 2.10.0 with HDF5 v1.8.4 just fine but it won't include the h5pl plugin attribute unless your HDF5 version is 1.10+ vis http://api.h5py.org/h5pl.html (this one got me good!).
pip install Cython
pip install h5py==2.9.0
pip install git+https://github.com/LooseLab/ru@master
# Install passed the first test at this stage
ru_generators

To get the read until working I started a new guppy server as suggested for the gridION. I got the settings to use from the existing guppy 3.2 logs on the miniT. I can't work out how to set "num socket threads" which is 1 in my logs but defaults to 2. The default of runners per device is 8.

screen sudo /opt/ont-guppy/bin/guppy_basecall_server \
--config /opt/ont-guppy/data/dna_r9.4.1_450bps_fast.cfg \
--log_path /var/log/ont/guppy --port 5556 --device cuda:all \
--chunk_size 1000 --chunks_per_runner 48 \
--num_callers 1

Toml internal validation

When a TOML file is loaded, it should be validated to ensure no errors in reference names etc.

small target regions

Hi,

I am planning to run this ReadUntil experiment using a small set of target regions (~3M bases, so ~0.1% of human genome). Do you think it will have a huge detrimental effect on the pores and eventually to the sequencing yield, as many unblocking will occur at the pores? I am thinking of whether it is better to include more regions (even though not in our regions of interest), just to keep the pores more occupied to sequencing.
Would appreciate to hear what you think?

Thanks,
Cen Liau

Installing on GridION

Hey,

Keen to try RU with our GridION, but your README states we need guppy 3.4 and the current software release for the GridION is only at 3.2. I don't really want to install anything on the GridION as have been burned by this in the past! I did think I could install ru on another machine and use the --host and --port flags to connect to the GridION but that's when I hit the guppy version snag (well at least I guess this is what is causing the error below)

Traceback (most recent call last):
  File "/home/md1mpar/wc/miniconda2/envs/ru/bin/ru_generators", line 11, in <module>
    load_entry_point('ru', 'console_scripts', 'ru_generators')()
  File "/home/md1mpar/wc/ru/ru/ru_gen.py", line 463, in main
    read_until_client = read_until.ReadUntilClient(
  File "/home/md1mpar/wc/read_until_api_v2/read_until_api_v2/main.py", line 239, in __init__
    self.connection, self.message_port = get_rpc_connection(
  File "/home/md1mpar/wc/read_until_api_v2/read_until_api_v2/load_minknow_rpc.py", line 174, in get_rpc_connection
    response = stub.list_devices(list_request)
  File "/home/md1mpar/wc/miniconda2/envs/ru/lib/python3.8/site-packages/grpc/_channel.py", line 826, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/md1mpar/wc/miniconda2/envs/ru/lib/python3.8/site-packages/grpc/_channel.py", line 729, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNIMPLEMENTED
	details = ""
	debug_error_string = "{"created":"@1583246614.752489054","description":"Error received from peer ipv4:143.167.151.27:8000","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"","grpc_status":12}"

What are my options here? You state in the README you've tested on a GridION? How did you do that? Install guppy 3.4, ru, and the readuntil api?

Thanks!

Best action on first connect to a run.

If we have a run that has been in progress and then start read until, we see a large number of reads unblocked. In many cases these reads maybe longer than the max chunks parameter. Thus we are unblocking very long molecules which we might prefer not too. A better course of action might be to estimate the length of the read on that first grab of data and have some rules for rejection or sequencing. This is especially true if we have very long reads in our libraries.

CPU basecalling - increased number of reads processed together?

First of all, installing the new Read Until code works like a charm! Thanks a lot!

We're trying to get Read Until going with CPU basecalling. We're using a machine with 40 physical Xeon Silver cores, and live basecalling in MinKnow (fast mode) seems to easily keep up with the incoming data from from the bulk FAST5 test file, so I hope that this should, at least in principle, not be a hopeless undertaking.

ru_generators starts up fine, but there may be some kind of communications issue between the Read Until code and the Guppy basecalling server. Specifically, this is the output I get from ru_generators:

2020-02-29 18:24:00,807 ru.ru_gen 1R/0.33953s
2020-02-29 18:24:47,350 ru.ru_gen 148R/46.54283s
2020-02-29 18:28:25,598 ru.ru_gen 205R/218.24645s

(and then nothing more, so maybe things are slowing down over time)

I assume that 'R' specifies the number of reads, and the number after that specifies the average amount of time spent on the initial basecalling per read? ... so it seems that the number of reads lumped together is much higher than in the example provided?

Guppy server command:

./guppy_basecall_server --config dna_r9.4.1_450bps_hac.cfg --port 5550 --log_path guppy_log.txt --ipc_threads 3 --max_queued_reads 2000 --data_path ../data --num_callers 1 --cpu_threads_per_caller 18

Read Until command:

./ru_generators --device MN25472 --experiment-name "test2" --toml example.toml --log-file read_until_log.txt

Apart from that, everything is exactly identical to the default / example files.

Setting up another guppy server

Alex and Matt,
If one has a linux box running new Minion-nc, AND new guppy (version 3.5.x) in a different directory (say home/local/bin), one could run basecalling "remote" using the local/bin copy in server mode as long as the toml file is properly configured (port other than 5555). Is that correct? I think that Alex had a diagram for some thing close to this in an earlier reply to someone setting up ru on a gridion. I want to set this up for a minion, and was wondering if running basecalling server on the same machine (but different installation directory) would work. Any advice on setting up basecalling server or toml file for this case?

Originally posted by @tchrisboles in #30 (comment)

Read Until on a Promethion

Hi Matt,
thank you for this nice peace of software. I have used your code and extened it to enable targeted enrichment from a BED file. Not perfect yet but it works! So I didn't use the generator script.

Up to now, we have managed to run Read Until on a Minion with great results. Now we tried it on a Promethion. I have set the first channel to 1 and the last channel to 3000. I'm using the fast basecalling model. Running it with a single worker it seems not to catch up with the guppy basecaller, so we increased the workers (to 6), which seems to be able to handle the number of reads:

`2020-05-12 15:42:38,611 [read_until_lafuga] Thread:6 21 reads/0.28550s Unblocked: 1565 Target: 398 Total reads: 4682 Guppy queue: 11
2020-05-12 15:42:38,612 [read_until_lafuga] Thread:3 1 reads/0.09670s Unblocked: 1667 Target: 460 Total reads: 4889 Guppy queue: 11
Throttle: 0.0032995089422911406
2020-05-12 15:42:38,647 [read_until_lafuga] Thread:2 5 reads/0.13400s Unblocked: 1591 Target: 426 Total reads: 4845 Guppy queue: 0
Throttle: 0.09994410490617156
2020-05-12 15:42:38,669 [read_until_lafuga] Thread:1 2 reads/0.13900s Unblocked: 1529 Target: 431 Total reads: 4708 Guppy queue: 4
2020-05-12 15:42:38,722 [read_until_lafuga] Thread:4 1 reads/0.10409s Unblocked: 1688 Target: 469 Total reads: 5077 Guppy queue: 6
Throttle: 0.09996974118985236
2020-05-12 15:42:38,866 [read_until_lafuga] Thread:3 5 reads/0.14211s Unblocked: 1668 Target: 460 Total reads: 4894 Guppy queue: 1

But Minknow only partially shows some unblocking or some kind of unblocking waves over the flowcell.
My assumption is that the basecalling and unblocking decisions are fast enough but that the Minkow RPC port can't handle the unblocking actions in time!? Do you have any experience/advices?

Have you ever tried to run RU on a promethION or do you have any advices what we could do? Or is it currently not possible to run it on a promethION?

Thank you for your help
Best Alex

Question regarding PC usage

Hi Loose Lab, thanks for the great software!
We are currently running our MinIONs using a PC desktop (32 Gb RAM, 12 CPU cores), so pretty reasonable machine. We also have direct access from the machine to high-performance compute servers with lots of GPUs, etc. I guess my quick question is whether you think it would be worth-while me trying to get 'read until' running on the PC, maybe using 'windows subsystem for linux', or should I just bite the bullet and get a new linux box? Alternatively, is there any way I can get our HPC servers to use 'read until' to control the MinION?
Thanks again,
Matt.

Choice of parameters

Hi,

Some weeks ago, we did a sequencing run to test ReadUntil without reference mapping and basecalling. Everything went fine and I could recognize a peak at 500bp as expected. But after 3 hours I observed, that a lot of action messages were failed and the read lengths were increasing on unblocked channels. Now I wonder if this could be caused by the parameters I selected for running ReadUntil (e.g. action batch size 100, unblock_duration 1.0). I'm not sure whether I completely understand how the parameters influence the behaviour of the gRPC stream. Do you have any suggestions for parameter choice?

Thanks
Jens

Manager Sending reset : human error ?

In our latest run, we noticed that read until script stopped with the following message :

2020-04-13 11:18:33,073 read_until_api_v2.main Reset request received, shutting down...
2020-04-13 11:18:33,086 read_until_api_v2.main Reset signal received by action handler.
2020-04-13 11:18:33,151 read_until_api_v2.main Stopping processing of reads due to reset.
2020-04-13 11:18:33,338 read_until_api_v2.main Stream handler exited successfully.
2020-04-13 11:18:33,395 ru.ru_gen Finished analysis of reads as client stopped.
2020-04-13 11:18:33,543 Manager Worker exited successfully.

The sequencing continued; but read until, for some reason stopped "randomly". This is the second time. The first time we thought it was a human error; but now, during that time, there was nobody around the machine ...

What can causes the script to exit ? We run ru_generators in a Linux screen (if that should be relevant)

thanks;

Chunk log information

Hey,
Could we have clear information on the meaning of the chunk log file?
Thanks

Alert when no targets are provided

When running an flowcell with read until, if no targets are provided (or the path to a file of targets is wrong) the experiment essentially becomes an 'unblock all' configuration. This is NOT good!

More explicit handling of reads that exceed the maximum sampling

Currently in the event of a given read exceeding the maximum threshold we unblock unless the last decision was "stop_receiving" see here. However, this is not fit if the reference only contains targets that need to be removed; as anything that doesn't classify will be unblocked.

The action to take in the event of exceeding max chunks need to be either user settable (adds more complexity) or we could provide pre-defined scenarios e.g.:

ru deplete ...
ru enrich ...

Where deplete would be the use case for unblocking anything that classifies against the reference; whereas enrich would do the opposite and stop receiving anything that classifies.

These options might not encompass mixed references.

Cant simulate

Hi, I am trying to get this up and running in a similar deployment as the one suggested in the gridion issue, but I am immediately hitting my head against being able to simulate. MinKNOW seems to accept the modified script but I am not seeing any reads in the run (FYI I am on windows).

Additionally, is there anyway to install the read until api v2 in a location where minknow is either not installed (I can copy over the files manually) or not installed in a default location.

Cheers!

Off-target region with high mapping reads

Hi,

We run a cosmic panel in NB4 cells and found a off-target region (MIR12136) with high mapping reads.
We wonder to know whether this off-target result was observed in your experiment?

圖片1

Amber

ReadUntil for filling in gaps?

Not really an issue, rather a question. Curious what others think about the idea of using ReadUntil with one MinION flowcell and some hopefully long Nanopore reads (N50 > 50 Kbp) to try and fill in gaps in a several Mbp region that was scaffolded with Hi-C reads but still has some gaps of unknown sizes (arbitrary 1000-bp gaps between contigs joined by Hi-C contact maps). It seems like one could use the desired region as the reference for ReadUntil and then use Racon to fill in gaps with the obtained enriched reads. Any thoughts?

Selection of mapping conditions for low abundance enrichment and targeted sequencing

The mapping conditions listed in your paper for these two goals are similar

multi-on: stop receiving
multi-off: proceed
single-on: stop receiving
single-off: unblock
no_map: proceed
no_seq: proceed

Two questions:

  1. If you unblock for single-off, why not unblock for multi-off? Seems more grounds for unblocking in the case of multi-off signal...?.

  2. Why proceed (in all schemes) for no_map and no_seq? What's the rationale?

Compatibility with MinKNOW-Core 4.04

Hi there, it looks like the GridION software and MinKNOW have received major updates on July 7 (notably, python2 -> python3). The nanoporetech read_until_api has also been updated.

Are there any plans to update ru and read_until_api_v2 to be compatible with the new MinKNOW? Thanks!

Read-until for short amplicons

Hi Matt,

We're using the amazing ARTIC protocol for Covid-19 sequencing but observe that there is sometimes significant individual-amplicon dropout. The amplicons are relatively short (~400). Would it be possible to modify your code so that reject decisions are made after the first, say, 20 or 40 bases, with the idea of enriching under-represented amplicons?

Cheers,

Alex

ru_summarise_fq error

I'm trying to reproduce the tests described in the readme, when I received this error :

Using reference: /data/minimap2.mmi
Traceback (most recent call last):
  File "/home/minion/read_until/bin/ru_summarise_fq", line 11, in <module>
    load_entry_point('ru==2.0.0', 'console_scripts', 'ru_summarise_fq')()
  File "/home/minion/read_until/lib/python3.6/site-packages/ru/summarise_fq.py", line 125, in main
    "{:.0f}".format(stdev(data)),
  File "/usr/lib/python3.6/statistics.py", line 650, in stdev
    var = variance(data, xbar)
  File "/usr/lib/python3.6/statistics.py", line 588, in variance
    raise StatisticsError('variance requires at least two data points')

I used the demo file and changed chr22 to chr15;

The experiment ran on a minion, connected to a computer with a GPU (2080 ti) so minknow, did not basecall directly (as this would work on CPU).

I had the "feeling" reads where beeing pushed out correctly and the basecall_server running GPU and connected to the RU was working; However I never installed minimap2 so I'm unsure how the ru knows if it maps in a correct region ?

After the test (~15min) I basecalled using the same GPU basecall_server (v3.4.4) and ran the above command; Did I forget a step ? the fastq's are 7,3mb, 13mb, 30mb, 32mb. So i guess they contain some thing ?

Thanks for this nice pioneering work :)

Playback own fast5

Dear RU authors,

Thanks to the well-written documents, we ran the simulation using the provided fast5 quite well.
However, we couldn't apply the simulation protocol on our own sequence fast5 generated by MinKnow (19.12.5, GridIon).
We found the internal formats of your fast5 quite different from those generated by MinKnow (multiple fast5 per run and no channel groups while your have a single fast5 and channel groups).
Is any way we can run simulation using our previous runs?

Thanks,
Yao-Ting

Specifying targets in a bed file

I could not find this in the documentation, apologies if I missed it, but I think a common use-case would be to have the targets of interest specified as intervals in a bed file. I noticed the [conditions.0] 'targets' part either takes a string or an array of targets, or could it also be a (bed) file?

Thanks,
Wouter

ru_generators sometimes fails when started before run starts, runs fine on restart of ru_generator

If I start the ru_generator before the run has started it goes into a wait mode, starts when the run starts, and often immediately fails. Restarting ru_generator fixes the issue.

2020-06-19 10:06:43,253 Manager ru_generators --experiment-name test --device MN26516 --toml example.toml --log-file RU_log.log
2020-06-19 10:06:43,254 Manager batch_size=512
2020-06-19 10:06:43,254 Manager cache_size=512
2020-06-19 10:06:43,254 Manager channels=[1, 512]
2020-06-19 10:06:43,254 Manager chunk_log=chunk_log.log
2020-06-19 10:06:43,254 Manager device=MN26516
2020-06-19 10:06:43,254 Manager dry_run=False
2020-06-19 10:06:43,254 Manager experiment_name=test
2020-06-19 10:06:43,254 Manager host=127.0.0.1
2020-06-19 10:06:43,254 Manager log_file=RU_log.log
2020-06-19 10:06:43,254 Manager log_format=%(asctime)s %(name)s %(message)s
2020-06-19 10:06:43,254 Manager log_level=info
2020-06-19 10:06:43,254 Manager paf_log=paflog.log
2020-06-19 10:06:43,254 Manager port=9501
2020-06-19 10:06:43,254 Manager read_cache=AccumulatingCache
2020-06-19 10:06:43,255 Manager run_time=172800
2020-06-19 10:06:43,255 Manager throttle=0.1
2020-06-19 10:06:43,255 Manager toml=example.toml
2020-06-19 10:06:43,255 Manager unblock_duration=0.1
2020-06-19 10:06:43,255 Manager workers=1
2020-06-19 10:06:43,308 Manager Initialising minimap2 mapper
2020-06-19 10:06:43,334 Manager Mapper initialised
2020-06-19 10:06:43,334 read_until_api_v2.main Client type: many chunk
2020-06-19 10:06:43,334 read_until_api_v2.main Cache type: AccumulatingCache
2020-06-19 10:06:43,335 read_until_api_v2.main Filter for classes: adapter and strand
2020-06-19 10:06:43,335 read_until_api_v2.main Creating rpc connection for device MN26516.
2020-06-19 10:06:43,759 read_until_api_v2.main Loaded RPC
2020-06-19 10:06:43,759 read_until_api_v2.main Waiting for device to start processing

Once the flow cell starts running it often crashes as follows:

...
2020-06-19 10:07:47,668 Manager Creating 1 workers
2020-06-19 10:07:47,669 read_until_api_v2.main Processing started
2020-06-19 10:07:47,669 read_until_api_v2.main Sending init command, channels:1-512, min_chunk:0
2020-06-19 10:07:47,673 read_until_api_v2.main <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.FAILED_PRECONDITION
        details = "Data acquisition not running, or analysis not enabled"
        debug_error_string = "{"created":"@1592554067.669713569","description":"Error received from peer ipv4:127.0.0.1:8002","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Data acquisition not running, or analysis not enabled","grpc_status":9}"

Then starting ru_generators again with the run already on its way, works just fine. Any idea what the problem is here? Could this be a timeout issue?

no_map reads seem mappable

Dear Authors,
When playing with your provided fast5, we occasionally observed several reads are continuously reported as "no_map" in the RU log even with many iterations and long lengths (>30, >7kb). Yet they are mappable by minimap after MinKnow basecalling (see attached). These reads should be unblocked as they were not in the TOML targets. But because they were consistently reported as "no_map" (still within max_chunks), they were entirely sequenced instead. We are curious why minimap and mappy in RU produce different mapping results (using the same map-ont setting). Thanks, Yao-Ting.

messageImage_1587361351018
messageImage_1587361322265

Read Until on a flongle

Hello,

I have managed to set up read until on a our laptop (12 threads), with the aim of running read until on flongles. As I believe less data is generated at once, the hope is that our laptop will be able to keep up. In order to simulate a flongle, would it just be a case of adjusting the --channels in the ru_generator command? I've set this to 128 below, but I think it should be 126 to represent a flongle.

If you can suggest any optimisations then I'd be very grateful.

Guppy server command:
Downloads/ont-guppy-cpu/bin/guppy_basecall_server --config /Downloads/ont-guppy-cpu/data/dna_r9.4.1_450bps_fast.cfg --port 5556 --log_path /ReadUnti l/read_until/Guppy_server/log.txt --ipc_threads 1 --max_queued_reads 1000 --num_callers 2 --cpu_threads_per_caller 3

ru_generator command:
ru_generators --device MN16259 --experiment-name "rut2" --toml human_chr_selection.toml --log-file rut2.log --log-level info --workers 4 --channels 1 128

I have modified break_reads_after_seconds to be 0.4 and max_chunks to 4.

The current output I'm getting from ru_generators looks okay so far:
2020-04-21 09:56:12,943 ru.ru_gen 9R/0.14979s
2020-04-21 09:56:12,993 ru.ru_gen 4R/0.08269s
2020-04-21 09:56:13,146 ru.ru_gen 3R/0.10329s
2020-04-21 09:56:13,265 ru.ru_gen 5R/0.11896s

How it looks after ~30 minutes.
ru

Native barcoding of samples

Hi, me again! I figured this question might be of interest for multiple users so I ask it here...
Is there something we have to take into account when we use (native) barcoding on samples prior to read until?

Is this something you tried?
Naively I would expect the barcodes don't align to any target, but we should quickly enough have enough unique sequence to enable a decision...

Eventually, we might want to balance out the coverage of the barcodes with read until, but right now we'll assume that our lab-fu is good enough to get some equimolar pooling right.

Cheers,
Wouter

Specifying coordinates in TOML

Readuntil target enrichment is working great if I specify individual chromosomes (either in the ru_generator TOML or in a separate txt file). However, if I try to specify a specific region of a chromosome (as detailed in the README and in issue #22), I get no enrichment. At the moment, I'm working with the fast5 files suggested in the readuntil README. Apologies if I'm doing something silly here, but any suggestions?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.