fanglab / nanodisco Goto Github PK

nanodisco: a toolbox for discovering and exploiting multiple types of DNA methylation from individual bacteria and microbiomes using nanopore sequencing.

License: Other

Dockerfile 0.32% R 91.83% Shell 7.54% Singularity 0.31%

nanodisco's People

Contributors

Stargazers

Watchers

Forkers

yi1873 sophy-zhou thomasvangurp alexanderwg-ornl jflopezfernandez cuhk-haosun

nanodisco's Issues

Object does not exist in this HDF5 file

Greetings,
I'm currently running into an error in Nanodisco that I can't seem to figure out. I have Nanodisco's Singularity container installed on my Linux workstation, and I'm using this command to begin the workflow on some Geobacillus nanopore data I generated:

nanodisco preprocess -p 30 -f dataset/ -o analysis/preprocessed_subset -r /home/nanodisco/test_Geo.fasta -s Geo

The ETA and the progress bar never change, and the error is thrown after about five minutes.

[2021-09-07 09:15:29] Localize all fast5 files. [2021-09-07 09:15:29] Found 351 fast5 files. [2021-09-07 09:15:29] Extract sequences from fast5. Processed fast5 [-------------------------] 0% eta: ?s (elapsed: 00:00:00)Error in { : task 1 failed - "task 1 failed - "task 1 failed - "Object '/read_004f4ce5-b3a5-4583-aa81-6e7acf854000/Analyses/Basecall_1D_000/BaseCalled_template' does not exist in this HDF5 file.""" Calls: extract.sequence -> %dopar% -> <Anonymous> Execution halted Unexpected error during read extraction process.

Any assistance would be greatly appreciated.

Open and parse .RDS file

Hi, I'm trying to open and visualize the .RDS file but I failed to find a right way to do it. Could you please give me some suggestions? Such information are very important for me to localize the moitifs in the genome of my dataset. Thank you so much!

Changes in coverage before and after Nanodisco preprocessing

Hi Alan,

We normalized our fast5 coverage across multiple samples to around 120x prior to running Nanodisco, we checked coverage with BWA and used the same commands Nanodisco is running. When we looked at the coverage reported by the preprocessing bam outputs we observed coverage dropped by 20-40X depending on the sample.

I'm assuming some of the fast5 reads are not being converted to fasta, or are being filtered out but aren't sure why or what we can change to avoid this.

Do you have any insight into why this might have occurred and how we can avoid this?

Regards

Could nanodisco detect methylation from other species

HI, @fanggang @jbeaulaurier ,Could nanodisco detect methylation from other species, such as human?

Potential issues with Guppy, the sup basecalling model, and barcodes

I'm currently attempting to do de novo methylation discovery on a Geobacillus species. I have in hand methylation motifs detected via PacBio, and I'm attempting to replicate this via Nanodisco. I generated WGA and native reads, but when I tried to compute differences I got through a quarter of the genome with no differences detected. Questions:

Does the relative read depth matter between the native and the WGA reads? I have a much shallower read depth for the WGA reads due to the reuse of a previous flowcell.
Do you anticipate barcodes potentially causing problems in your pipeline?
I'm using Guppy v5.0.14 and the dna_r9.4.1_450bps_sup.cfg calling model. Is this compatible with Nanodisco? If not, would you recommend using this highly accurate caller to produce the reference genome, then recalling it in some other version of guppy or even Albacore to produce the native and WGA basecalled fast5 files needed.

Thanks so much for your help! I very much look forward to using this wonderful tool you and your colleges have developed.

-Bill

Installation via (bio)conda

Is it possible to create a conda package of this software? Conda has a huge ecosystem of packages and it creates automatics and minimal containers as well for you. This would make it also reusable by people without any container support.

Error in preprocess

hello! when I run the step preprocess I got an error, the task keeps running on but there are not any results in the output directory.
root@nanodisco:/$ nanodisco preprocess -p 1 -f /nano40/9Sfast5pass -s 9Sfast5pass -o ./9Sresult -r nano40/LT2.fasta
[2022-04-22 19:40:47] Localize all fast5 files.
[2022-04-22 19:40:47] Found 22 fast5 files.
[2022-04-22 19:40:47] Extract sequences from fast5.
Processed fast5 [-------------------------] 0% eta: ?s (elapsed: 00:00:00)
Any help would be greatly appreciated!

Thanks,
kk

R package does not have a namespace

Hi,

First time Singularity (and Nanodisco) user and I'm getting this error when trying to run preprocess. Would it be something to do with my R or during installation, like during the below step in postInstall? A bit confused as parallel is a base package?

Thanks!

# Prepare for installing R packages
echo "options(repos ='https://cran.rstudio.com/', \
  unzip = 'internal', \
  download.file.method ='libcurl', \
  Ncpus = parallel::detectCores() )" >> /usr/local/lib/R/etc/Rprofile.site #  /etc/R/Rprofile.site

Unavailable to run personal metadate via Nanodisco

To whom it concerns,

I have met a cituation as following:

wzq@nanodisco:~$ nanodisco preprocess -p 40 -f dataset/seameta_data/20180609_1025_sea_meta_9_wzq/ -s seameta_9 -o analysis/preprocessed_subet -r reference/seameta_9.fasta
[2020-11-23 09:23:04] Extract sequences from fast5.
Warning message:
In extract.sequence(path_input, base_name, path_output, nb_threads, :
448214 reads weren't basecalled.
No reads were extracted. Please check that -f/--path_fast5 is correct.

Is it because of the basecalled software problem?

Best

reference metagenome file

Hi,

I was wondering how you recommend the reference metagenome file be generated when doing the 'generate current differences' step for metagenomic binning. Can the metagenomic contigs for my sample of interest be used, or is it necessary to combine reference quality genomes perhaps based on the organism abundance in my sample? If contigs can be used, do you have recommendations about using an assembly constructed from the wga run, the native run, or combining the reads from both runs? Thanks in advance!

Columns in nanodisco score output

Hi - I have a quick question about the nanodisco score output. What does each output file column represent? This is not stated in the docs (https://nanodisco.readthedocs.io/en/latest/commands_details.html?highlight=coverage#score).

Any help would be greatly appreciated!

Thanks,
Jon

nanodisco preprocess ERROR

Hi,
I'm using nanodisco to analyse methylation profile from bacteria isolates. I installed singularity (version 3.5.3) and nanodisco (v1.0.3), then I tried to import my fast5 files in nanodisco workflow.

here the CL:

singularity pull --name nanodisco.sif library://fanglab/default/nanodisco
singularity verify nanodisco.sif
Verifying partition: FS:
45AD365F84BDF7402CC3CA83F93AC2888FC02443
[REMOTE]  Alan Tourancheau <[email protected]>
[OK]      Data integrity verified
INFO:    Container verified: nanodisco.sif
singularity build --sandbox nd_env nanodisco.sif

It seems that everything was fine with the software.
Then I tried to run preprocess command but there was an error:

root@nanodisco:~$ nanodisco preprocess -p 4 -f dataset/test/ -s A_BAU_01 -o analysis/preprocessed_subset -r reference/GCF_008632635.1_ASM863263v1_genomic.fna
[2022-08-05 15:23:02] Localize all fast5 files.
[2022-08-05 15:23:02]     Found 1 fast5 files.
[2022-08-05 15:23:02] Extract sequences from fast5.
 Processed fast5 [-------------------------]   0% eta:  ?s (elapsed: 00:00:00)Error in { : 
  task 1 failed - "task 1 failed - "task 1 failed - "Object '/read_000272ba-22a8-4660-bdcd-546de7a4c075/Analyses/Basecall_1D_000/BaseCalled_template' does not exist in this HDF5 file."""
Calls: extract.sequence -> %dopar% -> <Anonymous>
Execution halted
Unexpected error during read extraction process.

When I run h5ls command on my data (outside nanodisco ), I get following output:

(base) [root@localhost EC_NAT]# h5ls /mnt/NFS_SHARE_17/A.baumannii_FRANCI/fast5_pass/barcode01/FAQ13937_pass_barcode01_9525cb6c_0.fast5 
read_000272ba-22a8-4660-bdcd-546de7a4c075 Group
read_001c24e0-a0a6-4893-80b0-b99a8315f045 Group
read_001f353e-ad41-4873-8110-21d5efe4d469 Group
read_0024dd71-2d5c-401a-8f76-8740cb5d9b04 Group
read_002edada-f555-4be4-9c80-e412549d8f3e Group
read_00389471-67f0-4142-9f72-a03a0cc33529 Group
read_004598b6-2d1d-40f7-9361-291926673138 Group

[...]
and so on for each read stored in fast5 file.

If I run the same command on fast5 files of nanodisco tutorial (obtained via get_data_bacteria in home/nanodisco/dataset/EC_NAT/ folder )
I obtain:

(base) [root@localhost EC_NAT]# h5ls EC_NAT.read1.fast5 
Analyses                 Group
Raw                      Group
UniqueGlobalKey          Group

It seems that I have different file structure based on h5ls command. How can I obtain same file structure? Could you help me to solve this issue? Thanks in advance.

Guppy compatibility

Hello,

I'm planning on running nanodisco on isolate genomes that I have generated native and WGA data for recently. My lab has a fast Guppy GPU Docker image for demultiplexing and basecalling the raw multi-read fast5 files and I would prefer not try to figure out how to work with Albacore. I got nanodisco working on the test E. coli data provided and I just wanted to make sure it will work for my data, at least for the de novo motif discovery. I understand that the methylation type fine mapping for guppy is still being trained.

Nathan

List of common methylation motifs is missing

Hi! I'm trying to do microbiome methylation binning with the automated approach without specific de novo discovered motifs. I'm using the docker installation as we don't have Singularity set up right now. Where can I find list_motifs.RDS ?

Thank you!
Lynn

root@nanodisco:~$ nanodisco profile -p 20 -r reference/6plex.fasta -d analysis/Mix2_6plex_difference.RDS -w analysis/preprocessed/Mix2_wga_subsample.cov -n analysis/preprocessed/
Mix2_native_subsample.cov -b Mix2_6plex_all -o analysis/binning -a all
List of common methylation motifs (/home/nanodisco/reference/list_motifs.RDS) is missing. Please reach out for help on GitHub.```

Installation / R errors?

Fang Lab,

Unfortunately the singularity installation option wasn't working for for our campus cluster or my lab machines. As such, I had to set up a conda environment on our cluster and attemptted to leverage available modules with compatible software versions of bwa, R, etc. After fixing file paths in your scripts things started to look like they were working. However, using your test data sets for both E. coli and the metagenome I'm hitting two different R errors:

$ nanodisco characterize -p 4 -b Ecoli -d dataset/EC_difference.RDS -o analysis/Ecoli_motifs -m GATC,CCWGG,GCACNNNNNNGTT -t nn -r reference/Ecoli_K12_MG1655_ATCC47076.fasta
[2022-06-13 12:34:39] Load supplied current differences.
[2022-06-13 12:34:46] Check current differences file version.
[2022-06-13 12:34:46] Determine motif signature center.
[2022-06-13 12:34:46] Process GATC.
[2022-06-13 12:34:46] Tag GATC occurrences.
[2022-06-13 12:34:55] Score GATC modified position.
[2022-06-13 12:34:59] Process CCWGG.
[2022-06-13 12:34:59] Tag CCWGG occurrences.
[2022-06-13 12:35:06] Score CCWGG modified position.
[2022-06-13 12:35:09] Process GCACNNNNNNGTT.
[2022-06-13 12:35:09] Tag GCACNNNNNNGTT occurrences.
[2022-06-13 12:35:15] Score GCACNNNNNNGTT modified position.
Error in { : task 1 failed - "Invalid unit"
Calls: find.signature.center -> %do% ->
Execution halted

It doesn't matter how many MOTIFs input (1,3, all 4 tested). It fails after the last one.

Is there a way to add verbose reporting to R within the context of your code? It seems like the error is stemming somewhere in characterize.R ln 63 referring to analysis_functions.R find.signature.center function on ln 2179.

After this failure, I then tried the metagenome example. The first two commands ran without error. The third errorred out:

$ nanodisco plot_binning -r reference/metagenome.fasta -u analysis/binning/methylation_binning_MGM1_motif.RDS -b MGM1_motif -o analysis/binning -a reference/motif_binning_annotation.RDS --MGEs_file dataset/list_MGE_contigs.txt
[2022-06-13 14:23:15] Prepare default metagenome annotation.
[2022-06-13 14:23:16] Load additional annotation.
[2022-06-13 14:23:17] Plot binning.
Error in unit(unclass(x), attr(x, "unit"), attr(x, "data")) :
Invalid unit
Calls: plot.tsne.motifs.score ... convertUnit -> upgradeUnit -> upgradeUnit.unit -> unit
Execution halted

Given your intimate familiarity with your code - any suggestions you have would be most welcome.

My best guess is that is an R package issue? Maybe? Since I was using a system install of R 4.1.2, there were some that I couldn't overwrite/update. What version of R would you recommend if I am installing it fresh within the conda environment?

Regards,
Patrick

nanodisco difference can't find the proper basecalling version

Hi!

Thanks for nanodisco, very nice tool!

Running the program in your example dataset works perfectly in my computer. However, when I try it in my data I am getting the following error:

Command

nanodisco difference -nj 5 -nc 1 -p 5 -f 1 -l 513 -i nanodisco/analysis/preprocessed_subset -o nanodisco/analysis/difference_subset -w barcode01 -n barcode07 -r nanodisco/reference/KPA.fasta

Output

local:5/56/100%/2.2s Error in if (grepl("Albacore", f5_data$name)) { :                                                    
argument is of length zero                                                                                            
Calls: check.basecall.version -> extract.basecall.version
Execution halted

The output happens over and over again for all the jobs.

My fast5 were basecalled with guppy 4.0.11+f1071ce and the reads' sequence is extracted without problem with nanodisco preprocess

Any idea what might be happening?

prepare.parameters.ROC

Hi，
I'm trying to draw ROC curve. But I don't know how to get these parameters. Please help me. I would be very grateful.

Here is the definition of this function.
prepare.parameters.ROC <- function(smoothing_win_sizes, peak_filtering_radii, peak_widths, win_sizes, offsets, start_points, coverages=NA, thresholds=NA)

Some questions about methylation detection

Hi, dear professor. I used the nanodisco to get the Combined current differences file. Now I want to find out which sites on the reference genome are methylated based on this file. what can i do next? For example, I want to know the probability of methylation occurring at 1000 positions on the reference genome. In addition, the two parameters "t_test_pval" and "u_test_pval" given in this file are smaller, the more it means modification? Because I found that they are not directly related to "mean_diff"
Thank you very much!

Suitable to compare isogenic methylation profiles?

Hello,

I am wondering if nanodisco is a suitable program to compare the methylation patterns of two isogenic strains of E. coli. Would this be possible through the methylation binning of metagenomic contigs functions?

Cheers,
Laura

WGA

Hi,I have a WGA data of EScherichia coli. Can this data be used together with NAT data of multiple escherichia coli for reference？Whether each E. coli strain needs its own WGA data？
Thanks a lot.

Issue when performing 'nanodisco characterize'

Dear Dr. Tourancheau,
I'm a phd student approaching for the first time to nanopore sequencing and DNA methylation characterization. I was trying to use nanodisco out of curiosity and everything proceeded fine until the typing and fine mapping step. In this step I performed the following call:

nanodisco characterize -p 20 -b Ecoli -d dataset/EC_subset_difference.RDS -o analysis/motif_detection/Ecoli_motifs -m CTGRYG,AWTMWTKAWAAAWR,CRCCAKCWGCGCRA,CTKCTCGKYAAAAC,ACWTCGMTCCSKGC -t nn,rf,knn -r reference/Escherichia_coli_ATCC_35218.fasta
Which returned the following error:
Error in { :
task 2 failed - "arguments imply differing number of rows: 1, 0"
Calls: find.signature.center -> %do% ->
Execution halted

I also tried the following:
nanodisco characterize -p 20 -b Ecoli -d dataset/EC_subset_difference.RDS -o analysis/motif_detection/Ecoli_motifs -m CTGRYG -t nn,rf,knn -r reference/Escherichia_coli_ATCC_35218.fasta
Which returned the following error:
Error in { :
task 2 failed - "arguments imply differing number of rows: 1, 0"
Calls: find.signature.center -> %do% ->
Execution halted

Hoping that you can help me with some advice, I provide a txt file explaining more in detail the calls I performed in my pipeline (https://drive.google.com/file/d/1ZMJxPnyO_ioS6wL1YQEofnq8mRjJMBv6/view?usp=sharing). You can also find my workspace compressed at the following link (which is approximatly 1GB):
https://drive.google.com/file/d/1rtiDOthSBrA_j6-VrqxxgowXvjBcs1rv/view?usp=sharing
Thank you for your time and attention.
Kind regards, Lorenzo Casbarra

installing issue

hi,
I'm trying to install nanodisco for methylation analysis on ONT data.
I installed Singularity (singularity version 3.8.6) via CONDA, when I compile nanodisco, it gives me warnings and errors as reported:

#first I download the image:
singularity pull --name nanodisco.sif library://fanglab/default/nanodisco

INFO: Downloading library image
1.9GiB / 1.9GiB [==========================================] 100 % 14.9 MiB/s 0s
WARNING: integrity: signature not found for object group 1
WARNING: Skipping container verification

#Then I tried to build nanodisco.sif
singularity build nd_env nanodisco.sif

INFO: Starting build...
INFO: Verifying bootstrap image nanodisco.sif
WARNING: integrity: signature not found for object group 1
WARNING: Bootstrap image could not be verified, but build will continue.
ERROR: unpackSIF failed: root filesystem extraction failed: could not extract squashfs data, unsquashfs not found
FATAL: While performing build: packer failed to pack: root filesystem extraction failed: could not extract squashfs data, unsquashfs not found

I downloaded squashfs-tools-4.5.1, and I tried to re-build nanodisco, but it still not working with the same error. May you help me please?

no WGA data

I used Illumina data for Nanopolish. is there any way to use nanodisco without having the WGA data?
i understand the concerns about some methylation motifs have weak signal to noise ratio etc.

Thanks!

Normalization did not reach convergence

Hi,

When running the difference function I am getting output such as:

Normalization did not reach convergence for 1 read(s) on chunk #375
Normalization did not reach convergence for 1 read(s) on chunk #411
No regional downsampling for SC12A_B2_WGA chunk #413: region too short (contig_1_pilon:2062580-2062587,+; 6 bp).
Regional downsampling: 558 reads from SC12A_B2_WGA chunk #413 (contig_1_pilon:2061822-2065831,-; 4008 bp).
Localized downsampling: 54 reads from SC12A_B2_WGA chunk #413.

Wondering what the consequences of this might be and if I might need to filter or improve the data set in someway.

Thanks

Error: tsne perplexity is too large

Hi, I am new to nanodisco and methylation profile analysis in general. I have come across the same issue multiple times while trying to perform methylation binning on a nanopore sequenced dataset composed of 1 bacterias and 2 phages. I've been following the detailed tutorial on the website as indicated and everything seems to go smoothly until I reach the binning step with nanodisco binning, where systematically come across this error with both datasets (I used automated profile matrix method):

[2022-08-30 09:27:48] Prepare default metagenome annotation.
[2022-08-30 09:27:48] Load supplied methylation profile matrix.
[2022-08-30 09:27:48] Retrieve contigs coverage information.
[2022-08-30 09:27:48] Perform binning with dimentionality reduction.
[2022-08-30 09:27:48]     Prepare methylation profile matrix.
[2022-08-30 09:27:48]     Dimentionality reduction.
[1]   3 123
Error in Rtsne.default(tsne_matrix, check_duplicates = FALSE, perplexity = tsne_perplexity,  : 
  Perplexity is too large.
Calls: tsne.motifs.score -> as.data.frame -> Rtsne -> Rtsne.default
Execution halted

What I find to be very odd is that the tsne matrix made by nanodisco only contains 3 distinct contigs for 123 motifs. Knowing that my filtered matrix from nanodisco filter_profile is of dimension 328 x 6 with the same 3 unique "contigs", it makes sense for the tsne matrix to be so small. However, when I compare my filtered profile matrix to the one in the example dataset, I notice that the example dataset contains WAY more contigs than mine (2905 unique contigs). Could it be a problem with the way my input files are treated which leads to all reads from a sample to be all placed within the same "contig" in the filtered profile matrix? Does that mean it's not possible to perform methylation binning on a single bacterial sample?

I also tried to lower the --tsne_perplexity parameter, but it doesn't change anything even when I put it to 1.

Finally, here is the command I used (dataset is named all):

nanodisco binning -r reference/all/all_ref.fasta -s analysis/all/binning_all/methylation_profile_all_auto_filtered.RDS -b all_binning_auto -o analysis/all/binning_all/binning --tsne_perplexity 5

Has anybody a clue of what could go wrong in the process ? I've been looking into this issues for hours and have yet to find how to make my data compatible with the binning process.

Thanks a lot!

"From" values not present in "X" error

Hi,

I have a group of samples which are giving me the below error during the difference command:

local:20/5/100%/171.8s The following `from` values were not present in `x`: -
local:20/6/100%/145.8s The following `from` values were not present in `x`: -
local:20/7/100%/127.0s The following `from` values were not present in `x`: -

It also appears that the chunks are skipped because the processing speeds up rapidly.

All my datasets have been basecalled with Guppy version 4.2.2 and my other samples haven't given this error. All the samples with this error were sequenced on the same flow cell, so I do suspect this is an issue with my input data, but do you know what features of the reads would cause Nanodisco to give this error?

Thanks

Nanodisco difference incomplete chunk analysis

Hi,
When running nanodisco difference on my files it is failing to complete processing for some of the chunks. The "stdout" file is showing that the analysis is proceeding to the "removing outliers" phase but it doesn't seem to progress beyond this and doesn't generate a difference.rds file. As this means that the large temporary files produced are never removed, this results in a build-up, eventually filling all available storage on the hard drive and halting the analysis. If I only run a single chunk which has previously failed to process, it produces all of the temporary files and treats the job as done, then the command ends without the actual output .rds file. The command I am running is as follows:

nanodisco difference -nj 4 -nc 1 -p 12 -i analysis/preprocessed_subset -o analysis/difference_subset -w DH5 -n DH5_Sal_BREX -r reference/DH5_reference_genome.fasta

As I say, it seems to process some chunks and not others (maybe around 2%`). I originally thought that this might be a coverage issue as some of the regions involved seemed to be around the end of the reference genome, but some of the other sites implicated seem to have good coverage.

The analysis itself seems to run fine and correctly picks out modified motifs, so this isn't a game-breaking issue, as such. The main issue is more with taking up many Gb of storage to the point that it halts the analysis. I then have to go through and manually delete the temporary files and start it again from where it stopped.

I'm not sure if there is something that I am doing wrong or if there is anything that can be done to change this. Any advice or help appreciated.

Thanks

nanodisco difference: object 'path_basecalling' not found

Hi,

This looks like a great tool and I'm excited to use it on my data. I was encountering similar issues to #4 and #5, made the suggested modifications (replacing extract.R, preprocess.sh, and difference_functions.R), and successfully preprocessed my data using --basecall_version with reads called by guppy versions 3.6.1.

However, when I run:

nanodisco difference -nj 2 -nc 1 -p 2 -f 100 -l 110 -i analysis/preprocessed_subset -o analysis/difference_subset -w SL_WGA -n SL_WGS -r reference/SL.fasta

I get this error for each chunk:

Error in h5readAttributes(path_first_fast5, path_basecalling) :
   object 'path_basecalling' not found
Calls: check.basecall.version ... extract.basecall.version -> h5readAttributes -> H5Lexists
Execution halted

Any insight into this error? I may have something configured incorrectly, but I can't figure it out. I re-called my reads with guppy version 4.0.15 and repeated the process ( specifying --basecall_version Guppy:4.0.15 ), but that doesn't seem to make a difference.

Thanks!

De novo discovery of methylation motifs for metagenomic samples

Hi,

I am trying my best to work out the ***difference.RDS for my metagenomic samples, and I gain 126M of sea9_subset_difference.RDS file.

However, when I tried to creat the motif files for this dataset, I met the following problem.

wzq@nanodisco:~$ nanodisco motif -p 4 -b seameta9_example -d analysis/sea9_subset_difference.RDS -o analysis -r reference/seameta_9_wga_trimmed_10kb.fasta -a
[2021-03-10 02:04:41] Prepare output folder.
[2021-03-10 02:04:41] Load supplied current differences.
[2021-03-10 02:04:58] Detect motifs.
[2021-03-10 02:04:58] Processing statistical signal.
Error in do.ply(i) : task 13 failed - "wrong sign in 'by' argument"
Calls: wrapper.motif.detection ... ddply -> ldply -> llply -> ->
Execution halted

Looking forward to the solution.

Best,

Return no data during current difference

Hi,
when trying to calculate current difference i keep get "No data for chunk #xx" for all chunks in stdout. I have checked the coverage of my contigs and they are all suitably high (>10 WGA and >100 NAT).

[1] "2021-12-16 07:52:37 CET"
[1] " Prepare index for zymo_wga."
[1] " Extract read mapped on chunks for zymo_wga."
[1] " Link fast5 to fasta for zymo_wga."
[1] " Prepare index for zymo_nat."
[1] " Extract read mapped on chunks for zymo_nat."
[1] " Link fast5 to fasta for zymo_nat."
[1] "2021-12-16 07:53:17 CET"
[1] "Processing chunk #115"
[1] " Preparing zymo_wga input data for chunk #115"
[1] " Preparing zymo_nat input data for chunk #115"
[1] " Correcting mapping."
[1] " No data for chunk #115"
[1] " Remove temporary files."
[1] "Processing chunk #116"
[1] " Preparing zymo_wga input data for chunk #116"
[1] " Preparing zymo_nat input data for chunk #116"
[1] " Correcting mapping."
[1] " No data for chunk #116"
[1] " Remove temporary files."
[1] "2021-12-16 07:54:42

The fast5 is in multiread format and was basecalled using guppy v. 5.0.7 and assembled using flye v. 2.9

nanodisco characterize error task 1 failed and RDS file issue

hello,
we are running nanodisco and we got this error at the characterize step.

nanodisco characterize -p 4 -b baumani -d analysis/merged_difference/baumani_difference.RDS -o analysis/baumani_motifs -m GATC,CCWGG,GCACNNNNNNGTT,AACNNNNNNGTGC -t nn -r reference_genome/Acinetobacter_baumannii_ATCC_BAA_747.fasta
[2022-09-13 14:11:52] Load supplied current differences.
[2022-09-13 14:11:52] Check current differences file version.
Models for Guppy version 6.3.4+cfaa134 is not yet available but we are working on it.
Motif characterization will still proceed with the default model but obtained results might not be optimal.
Additional information can be found in our GitHub repository.
[2022-09-13 14:11:52] Determine motif signature center.
[2022-09-13 14:11:52]   Process GATC.
[2022-09-13 14:11:52]     Tag GATC occurrences.
[2022-09-13 14:11:55]     Score GATC modified position.
[2022-09-13 14:11:56]   Process CCWGG.
[2022-09-13 14:11:56]     Tag CCWGG occurrences.
[2022-09-13 14:11:57]     Score CCWGG modified position.
[2022-09-13 14:11:57]   Process GCACNNNNNNGTT.
[2022-09-13 14:11:57]     Tag GCACNNNNNNGTT occurrences.
[2022-09-13 14:11:59]     Score GCACNNNNNNGTT modified position.
[2022-09-13 14:11:59]   Process AACNNNNNNGTGC.
[2022-09-13 14:11:59]     Tag AACNNNNNNGTGC occurrences.
[2022-09-13 14:12:01]     Score AACNNNNNNGTGC modified position.
Error in { : 
  task 1 failed - "arguments imply differing number of rows: 1, 0"
Calls: find.signature.center -> %do% -> <Anonymous>
Execution halted

So we had a look to the RDS File generated during the nanodisco difference step. It looks like this:

contig	position	dir	strand	N_wga	N_nat	mean_diff	t_test_pval	u_test_pval
9a03e25654c44fe8_1	1	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	5001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	10001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	15001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	20001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	25001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	30001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	35001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	40001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	45001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	50001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	55001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	60001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	65001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	70001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	75001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	80001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	85001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	90001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	95001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	100001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	105001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	110001	rev	t	0	0	NA	NA	NA
9a03e25654c44fe8_1	115001	rev	t	0	0	NA	NA	NA

We don't find any current differences detected. This could be a biological issue but what is also strange is that the positions reported are only every 5000 bp and only reverse sequences. Do you have an explanation for this?
do we have to set up a parameter in the difference step or upstream to have all the bases covered (1,2,3....).
thanks.
thanks for your kind help.
RB

Preprocessing not possible with extracted fast5 files

Dear

We are running NanoDisco on a dataset of which the native fast5 files originates from a subset of the original fast5 dataset (multiple barcodes within a single file). This subset was obtained using the ONT fast5_subset command. Nevertheless, whenever running these datasets, we do get following error in the preprocessing step:

[2022-05-25 16:41:28] Extract sequences from fast5. Warning message: In extract.sequence(path_input, base_name, path_output, nb_threads, : 1 reads weren't basecalled. No reads were extracted. Please check that -f/--path_fast5 is correct.

On the other hand did the amplified dataset (which did not originate from a subset of a bigger fast5 file and was generated as a single barcode fast5 after sequencing) gets processed properly. Is it possible there is a difference in the subset fast5 format which cannot be processed using the NanoDisco preprocess command?

Thank you in advance.
Regards
Nick

Only one motif detected

Hi,

Thanks for developing nanodisco. It is a great tool. We used nanodisco to look for mythlations in a Listeria strain but were only able to get one motif and the prediction score for the mythlated site is also quite low (please see attached). The coverage for the sequecing is aound 150x. I was wondering if would recommend increasing coverage to get more possible motifs.

Thanks,
Tongzhou
Motifs_classification_0h_nn_model.pdf

nanodisco score

Dear Alan,

Thank you for the great tool. Could you please explain how a methylation score in the Motifs_occurrences_scores file is calculated?
Regards,
Anna

nanodisco characterize

Hi, when I running characterize, there are the following tips:

Any suggestion about a solution?
Thanks a lot.

Computational resources with `nanodisco difference`

Hi,

First, thanks for the interesting tool!

I'm having a little trouble running nanodisco difference in the most effective way. The parameters -nj, -p, -nc, and possibly how many chunks are subset (-f and -l) seem to all effect how resources are used.

Let's say I'm using an HPC with 64 CPUs and 64 GB of memory, and I have a reference genome with 1200 chunks. Using settings that I thought would be sensible, -nj 16 -p 2 -nc 2 -f 1 -l 500, nanodisco tried to request way more CPUs and memory than were available, and then my HPC got mad and canceled the job.

Is there some sort of formula or recommended settings for nanodisco difference based on available computational resources. Ideally, I wouldn't subset the genome with -f and -l. (I could also request more resources, if that would be helpful).

Regards,

Mike

Problem reading the basecaller information

Hi Alan,

I'm having issues with Nanodisco reading the basecaller version on my reads. I'm using nanodisco version 1.0.2 on the singularity container. If I run the each step of the pipeline without specifying --basecall_version, it will run through preprocess, difference, merge and motif without error (and the motif results seem correct based on other data), but then, when I run characterize I get this error

[2021-06-21 14:58:04] Load supplied current differences.
[2021-06-21 14:58:04] Check current differences file version.
Error in if (!basecaller %in% c("Albacore", "Guppy")) { : 
  argument is of length zero
Calls: check.model.version
Execution halted

On the other hand, if I run from the beginning using the flag --basecall_version Guppy:4.0.11 on preprocess I get:

Parameter --basecall_version don't match available basecalling. Please provide the basecaller name and version (e.g. Guppy:3.2.4).
Available basecalling are displayed below.
  basecaller version  basecall_group
1    Unknown   4.0.5 Basecall_1D_000

Specifying --basecall_version Unknown:4.0.5 does not resolve the issue.

Essentially, it seems the nanodisco scripts are reading the "MinKNOW-Live-Basecalling" information in the fast5 file (which is near the top of the file) and ignoring the guppy information (which is near the bottom of the file). I don't code R at all, so I can't parse through your scripts to figure out what nanodisco is actually doing, however.

I've attached the first read from a set of fast5 files for a run that this error occurred. These reads were live basecalled on the gridion, and the fastq files all look good from a length and quality standpoint. I hope it will help get to the bottom of this, but I am happy to provide more information.

I'd really love to get this resolved so I can use your tool! Thanks in advance.

0009d4a2-1f1f-4b8d-a7eb-a75d6fa1c401.fast5.zip

run error

hi,
I'm using nanodisco for the analysis of methylation profile.
I downloaded and built nanodisco via singularity as suggested in the manual.
here my CL:

singularity pull --name nanodisco.sif library://fanglab/default/nanodisco
singularity verify nanodisco.sif
Verifying partition: FS:
45AD365F84BDF7402CC3CA83F93AC2888FC02443
[REMOTE]  Alan Tourancheau <[email protected]>
[OK]      Data integrity verified

INFO:    Container verified: nanodisco.sif

singularity build --sandbox nd_env nanodisco.sif

Then I opened nanodisco enviroment and I run the preprocessing step

singularity run --no-home -w nd_env
root@nanodisco:~/dataset$ nanodisco preprocess -p 10 -f barcode01/ -s barcode01 -o ../analysis/preprocessed_subset -r ../reference/ref.fasta
[2022-09-09 11:28:04] Localize all fast5 files.
[2022-09-09 11:28:04]     Found 23 fast5 files.
[2022-09-09 11:28:04] Extract sequences from fast5.
 Processed fast5 [-------------------------]   0% eta:  ?s (elapsed: 00:00:00)Error in { : 
  task 1 failed - "task 2 failed - "HDF5. Object header. Can't open object.""
Calls: extract.sequence -> %dopar% -> <Anonymous>
Execution halted
HDF5: infinite loop closing library
      L,G_top,T_top,P,P,Z,FD,E,SL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL
Unexpected error during read extraction process.

This command gave me an "HDF5: infinite loop closing library" error, could you help me to solve this issue?
Thanks in advance

error when running nanodisco motif command

Hi, I am running my own analysis and everything goes well until motif search, it starts running but after a while an error is shown:

2021-11-24 01:33:50] Prepare output folder.
[2021-11-24 01:33:50] Load supplied current differences.
[2021-11-24 01:33:50] Detect motifs.
[2021-11-24 01:33:50] Processing statistical signal.
Error in eval(e, x, parent.frame()) : object 'p' not found
Calls: wrapper.motif.detection ... subset -> subset -> subset.data.frame -> eval -> eval
Execution halted

one thing I have noted is that the chunks.rds file sizes are about 300 bytes and the final merged rds file is just 3.4kb.

However, when I run the example data, everything went well.

Any suggestion about a solution?

Thanks a lot

Column `motif` is unknown

Hi there, I ran into an error during nanodisco profile with -a all. I am using the docker installation and I redownloaded list_motifs.RDS as mentioned in #8 .

Appreciate the help!
Lynn

root@nanodisco:~$ nanodisco profile -p 15 -r reference/6plex.fasta -d analysis/Mix2_6plex_difference.RDS -w analysis/preprocessed/Mix2_wga_subsample.cov -n analysis/preprocessed/Mix2_native_subsample.cov -b Mix2_6plex_all -o analysis/binning -a all

Methylation profile are computed for all predefined common motifs (n=210,176) on long contigs only (>=100000 bp). This can take a while (>24h).
[2020-09-04 19:45:16] Read list of common motifs.
[2020-09-04 19:45:16] Prepare default metagenome annotation.
[2020-09-04 19:45:17] Load supplied current differences.
[2020-09-04 19:45:32] Load contigs coverage information.
[2020-09-04 19:45:32] Prepare subset of contigs (>=100000 bp).
[2020-09-04 19:45:33] Compute methylation features on subset of contigs.
[2020-09-04 19:45:33]     Initialize methylation feature computation.
[2020-09-04 19:45:50]     Processing motifs.
 Motifs processed (210165/210176): [======>] 100% eta:  4s (elapsed: 20:13:17)Error in { : task 136782 failed - "Column `motif` is unknown"
Calls: score.metagenome.motifs -> %dopar% -> <Anonymous>

How to choose the best chunks region for personal metagenome data

Hi,

I have run the chunks_info for my personsal reference metagenome data and got chunks of 100813.

As for the command to "computing current differences", the parameter to select the region of chunks, how to choose the best region?

I have tried the following command to get the RDS file:
nanodisco difference -nj 50 -nc 50 -p 50 -f 1 -l 1000 -i analysis/preprocessed_sea9_subset -o analysis/difference_sea9_subset_1-1000 -w SEA_9_WGA -n SEA_9_NAT -r reference/seameta_9_wga_trimmed.fasta

After I merge the differences files, the output of the RDS file is 1.6 mb. It is not available to go for the discovery of methylation motifs.

I would like to know how to choose the best region for computing current differences.

Best,

Bruce

Choosing basecaller version

Hi! I am trying to do methylation binning - I see in your FAQ that my native DNA and WGA datasets should use the same basecaller and version. My datasets were generated at different times and basecalled with different versions - I can rebasecall them with the same version, but I only see a place to specify the fast5 files and not basecalled fastq files.

Is there a way to specify which basecalled fastqs I want to use, and to only rely on the fast5 for signal information?
Or, a way to make sure the fastqs used from the fast5s are the correct ones, because the same fast5 files may be basecalled twice and hold two sets of fastq data?

Thank you!

Error in { : task 1 failed - "object 'contig' not found"

Dear Alan,

preprocess worked but difference is throwing the error described in the title.

This is the command:

nanodisco difference -nj 4 -nc 1 -p 1 -f 10 -l 11 -i analysis/preprocessed_subset -o analysis/difference_subset -w b24 -n b23 -r reference/mergenome.fasta

And this is the standard output:
'''
Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:2/0/100%/0.0s Error in { : task 1 failed - "object 'contig' not found"
Calls: %do% ->
Execution halted
local:1/1/100%/127.0s Error in { : task 1 failed - "object 'contig' not found"
Calls: %do% ->
Execution halted
local:0/2/100%/63.5s
'''

I am looking forward to getting your help to solve this issue.
Thank you!

Resource overutilization with nanodisco difference

Hi, I wanted to report an issue when running nanodisco difference, which appears related to #18. We have a large compute node on a SLURM cluster with 72 cores and 2TB memory. When allocating 24 cores and running nanodisco difference using Singularity as follows:

nanodisco difference \
  -nj 1 -nc 4 -p 4 -f 1 -l 4 \
  -i preprocessed \
  -o difference_subset \
  -w _WGA \
  -n sample6_NAT \
  -r sample6.fasta

And at the point the following output is generated

[1] "2021-12-08 21:44:21 CST"
[1] "Processing chunk #1"
[1] "  Preparing 6_iAB4340006_iAB4260006_WGA input data for chunk #1"
[1] "  Preparing 6_iAB4340006_iAB4260006_NAT input data for chunk #1"
[1] "  Correcting mapping."
[1] "  Normalization."

On the Normalization step, I am seeing four R processes running in parallel (which appears correct), but each of the processes seem to be competing for all 24 cores in the SLURM allocation which is driving the server load up significantly as well as appears to be slowing down computation dramatically. I suspect the issue is in normalize.data.parallel, but nothing in particular stands out. We could possibly debug this using a singularity build environment, but do you have any suggestions for a workaround?

I should add, the chunk does eventually finish and the other processes seem fine with CPU utilization (memory isn't an issue); however most HPC systems I work on would have these jobs killed by cluster admins. Any help would be greatly appreciated.

characterize

Hi,

I'm getting this error - any clarification would be appreciated. Thank you!

Rapid PCR Barcoding data as methylation free?

Hi Alan,

I've returned with another question. I know that you say that Nanodisco requires ONT sequenced WGA DNA for the methylation-free dataset. However, would data generated from the Rapid PCR Barcoding kit (SQK-RPB004) serve the same function? I ask mainly because I'm lazy and the Rapid PCR Barcoding protocol seems easier for multiplexing. While we are on the topic, are there other methylation-free DNA that could be generated/used for Nanodisco.

Thanks for any insight you may have on this.

Mike

Query and confusions; Please help

Hi,
I am still new to the Linux, GitHub, and coding thing. I read through the detailed documentation of the Nanodisco. I have already installed the singularity and nanodisco that is working. I did that with the example E. coli data.
The Q&A section in the documentation and also your paper say that nanodisco needs Albacore 2.3.4 but then it also mentions at some places about the guppy basecaller version. I have Guppy 6.0.7 installed on my linux system.

Should I use guppy 6.0.7 to basecall the data? What config file should I use for basecalling? (dna_r9.4.1_450bps_modbases_5mc_hac.cfg, dna_r9.4.1_450bps_modbases_5mc_hac_prom.cfg etc.)
If you suggest to use Albacore, would you please send me a link to install the proper version? I see there are several albacore codes on github. I was not sure which one I should be using.
I have multiplexed sequenced data for which I want to find the m5C and m6A methylation (methylation typing and fine typing).
I highly appreciate your help.
Thank you.

nanodisco motif p value

Hi,

I'm seeking help in this portion of the analysis: "motif: De novo discovery of methylation motifs from current differences file."
I'm getting an error in the statistical portion of the analysis, I think I have no p-values or there are no differences in methylation for my two sequences. Could it be the problem that I have no methylation sites if I come to this error or could I have made an error before and the p-values were lost in the process?
Sorry if this is confusing but I'd appreciate any insight on this issue
Thank you

PS: I cannot confirm there are or aren't p-values in the RDS

Portion of the script from the nanodisco documents where I'm stuck: (Error is below)
nanodisco motif -p <nb_threads> -b <base_name> -d <path_difference> -o <path_output> -r <path_genome> [+ advanced parameters]
-p : Number of threads to use.
-b : Base name for outputting results (e.g. Ecoli_K12). Default is 'results'.
-d : Path to current differences file (*.RDS produced from nanodisco difference).
-o : Path to output directory. Default is current directory.
-r : Path to a reference genome (i.e. fasta).
-h : Print help.

Error:

[2021-08-30 13:16:35] Prepare output folder.
[2021-08-30 13:16:35] Load supplied current differences.
[2021-08-30 13:16:35] Detect motifs.
[2021-08-30 13:16:35] Processing statistical signal.
Error in eval(e, x, parent.frame()) : object 'p' not found
Calls: wrapper.motif.detection ... subset -> subset -> subset.data.frame -> eval -> eval
Execution halted

Tmpdir not found with difference

Hi,

I would like to test nano disco but I have this problem on our HPC with nanodisco difference :

parallel: Error: Tmpdir '/localscratch/lcornet/68933322' does not exist.
parallel: Error: Try 'mkdir /localscratch/lcornet/68933322'

#log
$ singularity exec nanodisco.sif nanodisco preprocess -p 20 -f ULC307-fast5/ -s ULC307 -o analysis/ULC307 -r pilon/pilon.fasta
$singularity exec nanodisco.sif nanodisco chunk_info -r pilon/pilon.fasta
Number of chunks: 6289
$ singularity exec nanodisco.sif nanodisco difference -nj 1 -nc 20 -p 20 -f 1 -l 6289 -i analysis/ULC307 -o analysis/difference_subset -w ULC307 -n ULC307 -r pilon/pilon.fasta #To run on one node but 20 chunks, each 1 cpu

Could you help me ?

multiplexed libraries with multiple taxa

It seems that nanodisco assumes that the fast5 files all contain sequence data from the same reference (e.g., nanodisco preprocess -r <reference_genome>). What about instances where multiple strains/species/genera/etc are multiplexed on the same nanopore run? I would think that multiplexing would be common for modified base identification, given that sequencing just one microbial genome per nanopore run is generally a waste of resources.

fanglab / nanodisco Goto Github PK

nanodisco's People

Contributors

Stargazers

Watchers

Forkers

nanodisco's Issues

Is it because of the basecalled software problem?

Recommend Projects

Recommend Topics

Recommend Org

Jobs