a-ludi / dentist Goto Github PK

Close assembly gaps using long-reads at high accuracy.

Home Page: https://a-ludi.github.io/dentist/

License: MIT License

D 89.16% Shell 5.70% Python 4.04% Dockerfile 0.13% Makefile 0.64% Awk 0.27% jq 0.06%

genome-assembly gap-filling pacbio long-reads close-assembly-gaps snakemake damapper cluster daligner dub

dentist's Introduction

🚧 Maintenance Update: 🚧

This software project is presently not under active maintenance. Users are advised that there won't be regular updates or bug fixes.

We welcome any interested individuals to consider taking up the role of a new maintainer for the project. Feel free to express your interest or fork the project to continue its development.

Thank you for your understanding.

DENTIST

DENTIST uses long reads to close assembly gaps at high accuracy.

Long sequencing reads allow increasing contiguity and completeness of fragmented, short-read based genome assemblies by closing assembly gaps, ideally at high accuracy. DENTIST is a sensitive, highly-accurate and automated pipeline method to close gaps in (short read) assemblies with long reads.

API documentation: current, v4.0.0, v3.0.0, v2.0.0

First time here? Head over to the example and make sure it works.

Install
Usage
Example
Configuration
Troubleshooting
Citation
Maintainer
Contributing
License

Install

Use Conda via Snakemake (recommended)

Make sure Mamba (a frontend for Conda) is installed on your system. You can then use DENTIST like so:

# run the whole workflow on a cluster using Conda
snakemake --configfile=snakemake.yml --use-conda -jall
snakemake --configfile=snakemake.yml --use-conda --profile=slurm

The last command is explained in more detail below in the usage section.

Note: If you do not have mamba installed, you may need to pass --conda-frontend=conda to Snakemake.

Use Conda to Manually Install Binaries

Make sure Mamba (a frontend for Conda) is installed on your system. Install DENTIST and all dependencies like so:

mamba create -n dentist -c a_ludi -c bioconda dentist-core
mamba activate dentist
mamba install -c conda-forge -c bioconda snakemake

# execute the workflow
snakemake --configfile=snakemake.yml --cores=all

More details on executing DENTIST can be found in the usage section.

Use Pre-Built Binaries

Download the latest pre-built binaries from the releases section and extract the contents. The pre-built binaries are stored in a subfolder called bin. Here are the instructions for v4.0.0:

# download & extract pre-built binaries
wget https://github.com/a-ludi/dentist/releases/download/v4.0.0/dentist.v4.0.0.x86_64.tar.gz
tar -xzf dentist.v4.0.0.x86_64.tar.gz

# make binaries available to your shell
cd dentist.v4.0.0.x86_64
PATH="$PWD/bin:$PATH"

# check installation with
dentist -d
# Expected output:
# 
#daligner (part of `DALIGNER`; see https://github.com/thegenemyers/DALIGNER) [OK]
#damapper (part of `DAMAPPER`; see https://github.com/thegenemyers/DAMAPPER) [OK]
#DAScover (part of `DASCRUBBER`; see https://github.com/thegenemyers/DASCRUBBER) [OK]
#DASqv (part of `DASCRUBBER`; see https://github.com/thegenemyers/DASCRUBBER) [OK]
#DBdump (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#DBdust (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#DBrm (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#DBshow (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#DBsplit (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#fasta2DAM (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#fasta2DB (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#computeintrinsicqv (part of `daccord`; see https://gitlab.com/german.tischler/daccord) [OK]
#daccord (part of `daccord`; see https://gitlab.com/german.tischler/daccord) [OK]

The tarball additionally contains the Snakemake workflow, example config files and this README. In short, everything you to run DENTIST.

Use a Singularity Container (discouraged)

Remark: the Singularity container may not work properly depending on your system. (see issue #30)

Make sure Singularity is installed on your system. You can then use the container like so:

# launch an interactive shell
singularity shell docker://aludi/dentist:stable

# execute a single command inside the container
singularity exec docker://aludi/dentist:stable dentist --version

# run the whole workflow on a cluster using Singularity
snakemake --configfile=snakemake.yml --use-singularity --profile=slurm

The last command is explained in more detail below in the usage section.

Build from Source

Install the D package manager DUB.
Install JQ 1.6.

Build DENTIST using either

dub install dentist

git clone --recurse-submodules https://github.com/a-ludi/dentist.git
cd dentist
dub build

Runtime Dependencies

The following software packages are required to run dentist:

The Dazzler Data Base (>=2020-07-27)

Manage sequences (reads and assemblies) in 4bit encoding alongside auxiliary information such as masks or QV tracks
DALIGNER (=2020-01-15)

Find significant local alignments.
DAMAPPER (>=2020-03-22)

Find alignment chains, i.e. sequences of significant local alignments possibly with unaligned gaps.
DAMASKER (>=2020-01-15)

Discover tandem repeats.
DASCRUBBER (>=2020-07-26)

Estimate coverage and compute QVs.
daccord (>=v0.0.18)

Compute reference-based consensus sequence for gap filling.

Please see their own documentation for installation instructions. Note, the available packages on Bioconda are outdated and should not be used at the moment but they are available using conda install -c a_ludi <dependency>.

Please use the exact versions specified in the Conda recipe in case you experience troubles.

Usage

Before you start producing wonderful scientific results, you should skip over to the example section and try to run the small example. This will make sure your setup is working as expected.

Quick execution with Snakemake

TL;DR

wget https://github.com/a-ludi/dentist/releases/download/v4.0.0/dentist.v4.0.0.x86_64.tar.gz
tar -xzf dentist.v4.0.0.x86_64.tar.gz
cd dentist.v4.0.0.x86_64

# edit dentist.yml and snakemake.yml

# execute with CONDA:
snakemake --configfile=snakemake.yml --use-conda

# execute with SINGULARITY:
snakemake --configfile=snakemake.yml --use-singularity

# execute with pre-built binaries:
PATH="$PWD/bin:$PATH" snakemake --configfile=snakemake.yml

Install Snakemake version >=5.32.1 and prepare your working directory:

wget https://github.com/a-ludi/dentist/releases/download/v4.0.0/dentist.v4.0.0.x86_64.tar.gz
tar -xzf dentist.v4.0.0.x86_64.tar.gz

cp -r -t . \
    dentist.v4.0.0.x86_64/snakemake/dentist.yml \
    dentist.v4.0.0.x86_64/snakemake/Snakefile \
    dentist.v4.0.0.x86_64/snakemake/snakemake.yml \
    dentist.v4.0.0.x86_64/snakemake/envs \
    dentist.v4.0.0.x86_64/snakemake/scripts

Next edit snakemake.yml and dentist.yml to fit your needs and optionally test your configuration with

# see above for variants with pre-built binaries or Singularity
snakemake --configfile=snakemake.yml --use-conda --cores=1 -f -- validate_dentist_config

If no errors occurred the whole workflow can be executed using

# see above for variants with pre-built binaries or Singularity
snakemake --configfile=snakemake.yml --use-conda --cores=all

For small genomes of a few 100 Mbp this should run on a regular workstation. One may use Snakemake's --cores to run independent jobs in parallel. Larger data sets may require a cluster in which case you can use Snakemake's cloud or cluster facilities.

Executing on a Cluster

Please follow the setup steps from above except for the actual execution.

To make execution on a cluster easy DENTIST comes with examples files to make Snakemake use SLURM via DRMAA, sbatch or srun found under snakemake. If your cluster does not use SLURM please modify the profiles to suit your needs or read the documentation of Snakemake. Another good starting point is the Snakemake-Profiles project.

After you have selected an appropriate cluster profile, make it available to Snakemake, e.g.:

# choose appropriate file from `snakemake/profile-slurm.*.yml`
mkdir -p ~/.config/snakemake/slurm
cp ./snakemake/profile-slurm.submit-async.yml ~/.config/snakemake/slurm/config.yaml

Adjust the profile according to your cluster, e.g. you may need to specify accounting information. Values defined in cluster.yml can be used in the profile as demonstrated in the examples. This file is also the place to modify resource allocations and job names.

Now, you can execute the workflow like this:

snakemake --configfile=snakemake.yml --profile=slurm --use-conda

Snakemake will now start submitting jobs to your cluster until all the work is done. If something fails, you can execute the same command again to continue from the latest state of the workflow.

Manual execution

Please inspect the Snakemake workflow to get all the details. It might be useful to execute Snakemake with the -p switch which causes Snakemake to print the shell commands. If you plan to write your own workflow management for DENTIST please feel free to contact the maintainer!

Example

Make sure you have Snakemake 5.32.1 or later installed.

You can also use the convenient Conda package to execute the rules. Just make sure you have Mamba installed.

First of all download the test data and workflow and switch to the dentist-example directory.

wget https://github.com/a-ludi/dentist/releases/download/v4.0.0/dentist-example.tar.gz
tar -xzf dentist-example.tar.gz
cd dentist-example

Local Execution

Execute the entire workflow on your local machine using all cores:

# run the workflow
PATH="$PWD/bin:$PATH" snakemake --configfile=snakemake.yml --cores=all

# validate the files
md5sum -c checksum.md5

Execution takes approx. 7 minutes and a maximum of 1.7GB memory on my little laptop with an Intel® Core™ i5-5200U CPU @ 2.20GHz.

Execution with Conda

Make sure Mamba (a frontend for Conda) is installed on your system. Execute the workflow without explicit installation by adding --use-conda to the call to Snakemake:

# run the workflow
snakemake --configfile=snakemake.yml --use-conda --cores=all

# validate the files
md5sum -c checksum.md5

Note: If you do not have mamba installed, you may need to pass --conda-frontend=conda to Snakemake.

Execution in Singularity Container (discouraged)

Remark: the Singularity container may not work properly depending on your system. (see issue #30)

Execute the workflow inside a convenient Singularity image by adding --use-singularity to the call to Snakemake:

# run the workflow
snakemake --configfile=snakemake.yml --use-singularity --cores=all

# validate the files
md5sum -c checksum.md5

Cluster Execution

Please follow the instructions "Executing on a Cluster" above.

Configuration

DENTIST comprises a complex pipeline of with many options for tweaking. This section points out some important parameters and their effect on the result or performance.

The default parameters are rather conservative, i.e. they focus on correctness of the result while not sacrificing too much sensitivity.

We also provide a greedy sample configuration (snakemake/dentist.greedy.yml) which focuses on sensitivity but may introduce more errors. Warning: Use with care! Always validate the closed gaps (e.g. manual inspection).

In any case, the workflow creates an intermediate assembly workdir/{output_assembly}-preliminary.fasta that contains all closed gaps, i.e. before validation. It is accompanied by an AGP and BED file. You may inspect these file for maximum sensitivity.

How to Choose DENTIST Parameters

While the list of all commandline parameters is a good reference, it does not provide an overview of the important parameters. Therefore, we provide this shorter list of important and influential parameters. Please also consider adjusting the performance parameter in the workflow configuration (snakemake/snakemake.yml).

--read-coverage: This is the preferred way of providing values to --max-coverage-reads, --max-improper-coverage-reads and --min-coverage-reads. See below how their values are derived from --read-coverage.

Ideally, the user provides the haploid read coverage which, can be inferred using a histogram of the alignment coverage across the assembly. Alternatively, the average raw read coverage can be used which is the number of base pairs in the reads divided by the number of base pairs in the assembly.
--ploidy: Combined with --read-coverage, this parameters is the preferred way of providing --min-coverage-reads.

We use the Wikipedia definition of ploidy, as "the number of complete sets of chromosomes in a cell" (https://en.wikipedia.org/wiki/Ploidy)
--max-coverage-reads, --max-improper-coverage-reads: These parameters are used to derive a repeat mask from the ref vs. reads alignment. If the coverage of (improper) alignments is larger than the given theshold it will be considered repetitive. If supplied, default values are derived from --read-coverage as follows:

The maximum read coverage C_max is calculated from the global read coverage C (provided via --read-coverage) such that the probability of observing more than C_max alignments in a unique (non-repetitive) genomic region is very small (see paper, Methods section and Supplementary Table 2). In practice, this probability is approximated via
```
C_max = floor(C / log(log(log(b * C + c) / log(a))))
where
    a = 1.65
    b = 0.1650612
    c = 5.9354533
```
To further increase the sensitivity, DENTIST searches for smaller repeat-induced local alignments. To this end, we define an alignment as proper if there are at most 100 bp (adjustable via --proper-alignment-allowance) of unaligned sequence on either end of the read. All other alignments, where only a smaller substring of the read aligns, are called improper. Improper alignments are often indicative of repetitive regions. Therefore, DENTIST considers genomic regions, where the number of improper read alignments is higher than a threshold to be repetitive. By default, this threshold equals half the global read coverage C. (see paper, Methods section). In practice, a smoothed version of max(4, x/2) is used to provide better performance for very low read coverage. The maximum improper read coverage I_max is computed as
```
I_max = floor(a*x + exp(b*(c - x)))
where
    a = 0.5
    b = 0.1875
    c = 8
```
--dust-{reads,ref}, --daligner-{consensus,reads-vs-reads,self}, --damapper-ref-vs-reads, --datander-ref, --daccord:
These options allow passing parameters to the respective tools. They may have dramatic influence on the result. The default settings work well for PacBio CLR reads and should also work well with raw Nanopore data.

In-depth discussion of each tool goes beyond the scope of this document, please refer to the respective documentations (DBdust, daligner, damapper, datander, daccord).
--max-insertion-error: Strong influence on quality and sensitivity. Lower values lead to lower sensitivity but higher quality. The maximum recommended value is 0.05.
--min-anchor-length: Higher values results in higher accuracy but lower sensitivity. Especially, large gaps cannot be closed if the value is too high. Usually the value should be at least 500 and up to 10_000.
--min-reads-per-pile-up: Choosing higher values for the minimum number of reads drastically reduces sensitivity but has little effect on the quality. Small values may be chosen to get the maximum sensitivity in de novo assemblies. Make sure to throughly validate the results though.
--min-spanning-reads: Higher values give more confidence on the correctness of closed gaps but reduce sensitivity. The value must be well below the expected coverage.
--allow-single-reads: May be used under careful consideration in combination with --min-spanning-reads=1. This is intended for one of the following scenarios:
1. DENTIST is meant to close as many gaps as possible in a de novo assembly. Then the closed gaps must be validated by other means afterwards.
2. DENTIST is used not with real reads but with an independent assembly.
--existing-gap-bonus: If DENTIST finds evidence to join two contigs that are already consecutive in the input assembly (i.e. joined by Ns) then it will preferred over conflicting joins (if present) with this bonus. The default value is rather conservative, i.e. the preferred join almost always wins over other joins in case of a conflict.
--join-policy: Choose according to your needs:
- scaffoldGaps: Closes only gaps that are marked by Ns in the assembly. This is the default mode of operation. Use this if you do not want to alter the scaffolding of the assembly. See also --existing-gap-bonus.
- scaffolds: Allows whole scaffolds to be joined in addition to the effects of scaffoldGaps. Use this if you have (many) scaffolds that are not yet full chromosome-scale.
- contigs: Allows contigs to be rearranged freely. This is especially useful in de novo assemblies before applying any other scaffolding methods as it increases the contiguity thus increasing the chance that large-scale scaffolding (e.g. Bionano or Hi-C) finds proper joins.
--min-coverage-reads, --min-spanning-reads, --region-context: DENTIST validates closed gaps by mapping the reads to the gap-closed assembly. It requires for each gap and the base pairs down- and upstream (--region-context) are (1) covered by at least --min-coverage-reads reads at every position and (2) are spanned by at least --min-spanning-reads reads. Thus, increasing any of these numbers makes the valid gaps more robust but may reduce their number.

If --min-coverage-reads is not provided, it will be derived from --read-coverage (see above) and --ploidy. Given (haploid) read coverage C and ploidy p, the minimum read coverage C_min is calculated as
```
  C_min = C / (2 * p)
```
This corresponds to 50% of the long read coverage expected to be sequenced from a haploid locus (see paper, Methods section).

Choosing the Read Type

In the examples PacBio long reads are assumed but DENTIST can be run using any kind of long reads. Currently, this is either PacBio or Oxford Nanopore reads. For using none-PacBio reads, the reads_type in snakemake.yml must be set to anything other than PACBIO_SMRT. The recommendation is to use OXFORD_NANOPORE for Oxford Nanopore. These names are borrowed from the NCBI. Further details on the rationale can found in this issue.

Cluster/Cloud Execution

Cluster job schedulers can become unresponsive or even crash if too many jobs with short running time are submitted to the cluster. It is therefore advisable to adjust the workflow accordingly. We tried to provide a default configuration that works in most cases as is but the application scenarios can be very diverse and manual adjustments may become necessary. Here is a small guide which config parameters influence the number of jobs and how much resources they consume.

threads_per_process: Sets the maximum number of threads/cores a single job may use. A single-threaded job will always allocate a single core but thread-parallel steps, e.g. the sequence alignments, will use up to threads_per_process if snakemake has been provided enough cores via --cores.
-s<block_size:uint>: The assembly and reads FAST/A files are converted into Dazzler DBs. These DBs store the sequence in a 2-bit encoding and have additional features like tracks (similar to BED files). Also they are split into blocks of <block_size>Mb. Alignments are calculated on the basis of these blocks which enables easy distribution onto the cluster. The larger the block size the longer are the alignment jobs and the more memory they require but also the number of jobs is reduced. Experience shows that the block size should be between 200Mb and 500Mb.
propagate_batch_size: The repeat masks are homogenized by propagating them from the assembly to the reads and back again. Usually these jobs are very short because the propagation is parallelized over the blocks of the reads DB. To reduce the number of jobs both propagation directions are grouped together and submitted in batches of propagate_batch_size read blocks. Increasing propagate_batch_size reduces the number of submitted jobs and increases the run time per job. It has no effect on the memory requirements.
batch_size: In the collect step DENTIST identifies candidates for gap closing each consisting of a pile up of reads. From these pile ups consensus sequences are computed and validated in the process step. Each job process batch_size pile ups. Increasing batch_size reduces the number of submitted jobs and increases the run time per job. It has no effect on the memory requirements.
validation_blocks: The preliminarily closed gaps are validated by analyzing how the reads align to each closed gap. The validation is conducted in independent jobs for validation_blocks many blocks of the gap-closed assembly. Decreasing validation_blocks reduces the number of submitted jobs and increases the run time and memory requirements per job. The memory requirement is proportional to the size of the read alignment blocks.

Troubleshooting

Regular `ProtectedOutputException`

Snakemake has a built-in facility to protect files from accidental overwrites. This is meant to avoid overwriting precious results that took many CPU hours to produce. If executing a rule would overwrite a protected file, Snakemake raises a ProtectedOutputException, e.g.:

ProtectedOutputException in line 1236 of /tmp/dentist-example/Snakefile:
Write-protected output files for rule collect:
workdir/pile-ups.db
  File "/usr/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 136, in run_jobs
  File "/usr/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 441, in run
  File "/usr/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 230, in _run
  File "/usr/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 155, in _run

Here workdir/pile-ups.db is the protected file that caused the error. If you are sure of what you are doing, you can simply raise the protection by chmod -R +w ./workdir and execute Snakemake again. Now, it will overwrite any files.

No internet connection on compute nodes

If you have no internet connection on your compute nodes or even the cluster head node and want to use Singularity for execution, you will need to download the container image manually and put it to a location accessible by all jobs. Assume /path/to/dir is such a location on your cluster. Then download the container image using

# IF internet connection on head node
singularity pull --dir /path/to/dir docker://aludi/dentist:stable

# ELSE (on local machine)
singularity pull docker://aludi/dentist:stable
# copy dentist_stable.sif to cluster
scp dentist_stable.sif cluster:/path/to/dir/dentist_stable.sif

When the image is in place you will need to adjust your configuration in snakemake.yml:

dentist_container: "/path/to/dir/dentist_stable.sif"

Now, you are ready for execution.

Note, if you want to use Conda without internet connection, you can just use the pre-compiled binaries instead because they are just what Conda will install. Be sure to adjust your PATH accordingly, e.g.:

PATH="$PWD/bin:$PATH" snakemake --configfile=snakemake.yml --profile=slurm

Illegally formatted line from `DBshow -n`

This error message may appear in DENTIST's log files. It is a known bug that will be fixed in a future release. In the meantime avoid FASTA headers that contain a literal " :: ".

Citation

Arne Ludwig, Martin Pippel, Gene Myers, Michael Hiller. DENTIST — using long reads for closing assembly gaps at high accuracy. GigaScience, Volume 11, 2022, giab100. https://doi.org/10.1093/gigascience/giab100

Maintainer

DENTIST is being developed by Arne Ludwig <[email protected]> at the Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany.

Contributing

Contributions are warmly welcome. Just create an issue or pull request on GitHub. If you submit a pull request please make sure that:

the code compiles on Linux using the current release of dmd,
your code is covered with unit tests (if feasible) and
dub test runs successfully.

It is recommended to install the Git hooks included in the repository to avoid premature pull requests. You can enable all shipped hooks with this command:

git config --local core.hooksPath .githooks/

If you do not want to enable just a subset use ln -s .githooks/{hook} .git/hooks. If you want to audit code changes before they get executed on your machine you can you cp .githooks/{hook} .git/hooks instead.

License

This project is licensed under MIT License (see LICENSE).

dentist's People

Contributors

Stargazers

Watchers

Forkers

classicvalues yuzhenpeng yun-xia pythseq pengfeiintuebingen ningshuang-yao

dentist's Issues

Snakefile: Line 334, dbsplit doesn't recognize the "-T" option

344: dbsplit_flags = ensure_threads_flag(make_flags(config.get("dbsplit_flags", [])))

results in “-T is not a valid option for DBsplit”

suggest change:

344: dbsplit_flags = make_flags(config.get("dbsplit_flags", [])

Broken symlinks and/or missing files from data test set

It seems that there are files missing from the example dataset at https://github.com/a-ludi/dentist-example/releases/download/v1.0.2-2/dentist-example.tar.gz.

Although there are mentions of this in some of the other issue threads, I don't see an open issue thread that specifically for the missing files.

I downloaded your test data set today 9th June 2021:

wget https://github.com/a-ludi/dentist-example/releases/download/v1.0.2-2/dentist-example.tar.gz
tar -xzf dentist-example.tar.gz
cd dentist-example

However, several symbolic links point to a folder called ./dentist-example/ that is missing from the tarball:

~/dentist-example$ ll
total 111M
-rw-r--r-- 1 ubuntu djs217 2.3K Apr 28 04:40 checksum.md5
lrwxrwxrwx 1 ubuntu djs217   54 Apr 26 08:30 cluster.yml -> dentist-example/external/dentist/snakemake/cluster.yml
drwxrwxr-x 2 ubuntu djs217 4.0K Jun  9 13:39 data
-rw-r--r-- 1 ubuntu djs217 1.3K Feb 18 14:42 dentist.json
-rwxr-xr-x 1 ubuntu djs217 111M Apr 27 08:52 dentist_v1.0.2.sif
lrwxrwxrwx 1 ubuntu djs217   66 Apr 26 08:29 profile-slurm.drmaa.yml -> dentist-example/external/dentist/snakemake/profile-slurm.drmaa.yml
lrwxrwxrwx 1 ubuntu djs217   73 Apr 26 08:29 profile-slurm.submit-async.yml -> dentist-example/external/dentist/snakemake/profile-slurm.submit-async.yml
lrwxrwxrwx 1 ubuntu djs217   72 Apr 26 08:29 profile-slurm.submit-sync.yml -> dentist-example/external/dentist/snakemake/profile-slurm.submit-sync.yml
-rw-r--r-- 1 ubuntu djs217 4.5K Apr 27 08:54 README.md
lrwxrwxrwx 1 ubuntu djs217   52 Apr 26 08:29 Snakefile -> dentist-example/external/dentist/snakemake/Snakefile
-rw-r--r-- 1 ubuntu djs217 3.8K Apr 28 04:40 snakemake.yml

Many thanks,

David

Dentist parameters

Hi,
I managed to run the pipeline but know I am having a closer look at the options.
My goal is to be conservative with this: I ma happy to fill gaps when there is strong evidence for it but I don't want to mess up the assembly because it is pretty good overall I think.

From what I gathered form the github, --read-coverage combined with the ploidy is one very important option. Do you have any recommendation about how to obtain the coverage value ? Is there recommended tools for this ?

I am thinking I should not play much with the other parameters as I have CLR data.

Any advice on this ?

Best
Aurélien

File locking blocks indefinitely in `writePileUps`

Hi Arne,

[...] The job seems to be stuck in “dentist collect”. The file “workdir/pile-ups.db” was created but it is empty. The node that it is running on shows that 71G of memory is used and 52G available.

Here are the last entries in the “collect.log” file.

{"thread":140737354013504,"timestamp":637255135504311334,"numPileUps":260,"numAlignmentChains":3186}
{"thread":140737354013504,"timestamp":637255135504323016,"state":"exit","function":"dentist.commands.collectPileUps.PileUpCollector.buildPileUps","timeElapsed":25742682}
{"thread":140737354013504,"timestamp":637255135504332559,"state":"enter","function":"dentist.commands.collectPileUps.PileUpCollector.writePileUps"}

Do you have any ideas? Could there be a problem with the pipeline before it hit dentist collect? Watching it run is a thing of beauty!

Regards,

Randy

Originally posted by @BradleyRan in #3 (comment)

FATAL: Unable to handle docker://aludi/dentist:v1.0.1 uri

Hey Ludi,

I hope you are ok.
I work at the Sanger in the Darwin Tree of Life and Gene suggested me to try your tool to close a few assembly gaps.
One I run the test on the command line it finishes ok. By the time I change to send it to lsf I get an error concerning the version of LAsort? Could you please have a look:

[Tue Apr 13 11:06:22 2021]
Error in rule mask_dust:
    jobid: 15
    output: workdir/.assembly-test.dust.anno, workdir/.assembly-test.dust.data
    shell:
        DBdust workdir/assembly-test
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    cluster_jobid: 179164 logs/cluster/mask_dust/dam=assembly-test/jobid15_fba32544-ef79-47b9-a8e5-0b95ca02bd59.out
Error executing rule mask_dust on cluster (jobid: 15, external: 179164 logs/cluster/mask_dust/dam=assembly-test/jobid15_fba32544-ef79-47b9-a8e5-0b95ca02bd59.out, jobscript: /lustre/scratch116/vr/projects/vgp/user/mu2/dentist/dentist-example2/dentist-example/.snakemake/tmp.4lrb2f3l/snakejob.mask_dust.15.sh). For error details see the cluster log and the log files of the involved rule(s).
[Tue Apr 13 11:06:22 2021]
Error in rule tandem_alignment_block:
    jobid: 18
    output: workdir/TAN.assembly-test.1.las
    log: logs/tandem-alignment.assembly-test.1.log (check log file(s) for error message)
    shell:
            {
                cd workdir
                datander '-T8' -s126 -l500 -e0.7 assembly-test.1
                LAcheck -v assembly-test TAN.assembly-test.1.las || { echo 'Check failed. Possible solutions:
Duplicate LAs: can be fixed by LAsort from 2020-03-22 or later.
In order to ignore checks entirely you may use the environment variable SKIP_LACHECK=1. Use only if you are positive the files are in fact OK!'; (( ${SKIP_LACHECK:-0} != 0 )); }
            } &> logs/tandem-alignment.assembly-test.1.log
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    cluster_jobid: 179165 logs/cluster/tandem_alignment_block/dam=assembly-test.block=1/jobid18_8a60e75f-99ad-4021-bb14-0559b3bd4dc0.out
Error executing rule tandem_alignment_block on cluster (jobid: 18, external: 179165 logs/cluster/tandem_alignment_block/dam=assembly-test.block=1/jobid18_8a60e75f-99ad-4021-bb14-0559b3bd4dc0.out, jobscript: /lustre/scratch116/vr/projects/vgp/user/mu2/dentist/dentist-example2/dentist-example/.snakemake/tmp.4lrb2f3l/snakejob.tandem_alignment_block.18.sh). For error details see the cluster log and the log files of the involved rule(s).

I try exporting the variable but got the same error.
Could you help me?
Thank you.
Marcela.

Missing half of output files from test run

Dear Arne,

I'm having some trouble getting dentist to run on our university cluster with the example dataset. I haven't used snakemake before, so please excuse if I missed something very basic. Here is what I've tried so far:

The symlinks to the .yml configuration files in the v.1.0.2-2 example folder appear broken on my system, and v.1.0.1 only has a drmaa config file. I thus took the ones from the pre-built binaries and adjusted them as follows:

Fixed some inconsistencies with file endings .yml / .yaml
Added -A PROJECT_NAME to the profile-slurm.submit-async.yml file (requirement on this cluster) and copied to $HOME/.config/snakemake/slurm/config.yaml
Changed partition from batch to the actual cluster partition name in the cluster.yml file
Changed file paths and increased max thread number to 32 in the snakemake.yml file
Changed read coverage and ploidy in the dentist.json file according to the one in v.1.0.1

Running the pipeline using snakemake v.6.4.0 and singularity v.3.7.1 finished in ~5.5h without any obvious error in the log files. However, there is no output and md5sum -c checksum.md5shows half of the files as missing:

gap-closed.fasta: FAILED open or read
workdir/.assembly-test.bps: OK
workdir/.assembly-test.dentist-reads.anno: OK
workdir/.assembly-test.dentist-reads.data: OK
workdir/.assembly-test.dentist-self.anno: OK
workdir/.assembly-test.dentist-self.data: OK
workdir/.assembly-test.dust.anno: OK
workdir/.assembly-test.dust.data: OK
workdir/.assembly-test.hdr: OK
workdir/.assembly-test.idx: OK
workdir/.assembly-test.tan.anno: OK
workdir/.assembly-test.tan.data: OK
md5sum: workdir/.gap-closed-preliminary.bps: No such file or directory
workdir/.gap-closed-preliminary.bps: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.dentist-self.anno: No such file or directory
workdir/.gap-closed-preliminary.dentist-self.anno: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.dentist-self.data: No such file or directory
workdir/.gap-closed-preliminary.dentist-self.data: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.dentist-weak-coverage.anno: No such file or directory
workdir/.gap-closed-preliminary.dentist-weak-coverage.anno: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.dentist-weak-coverage.data: No such file or directory
workdir/.gap-closed-preliminary.dentist-weak-coverage.data: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.dust.anno: No such file or directory
workdir/.gap-closed-preliminary.dust.anno: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.dust.data: No such file or directory
workdir/.gap-closed-preliminary.dust.data: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.hdr: No such file or directory
workdir/.gap-closed-preliminary.hdr: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.idx: No such file or directory
workdir/.gap-closed-preliminary.idx: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.tan.anno: No such file or directory
workdir/.gap-closed-preliminary.tan.anno: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.tan.data: No such file or directory
workdir/.gap-closed-preliminary.tan.data: FAILED open or read
workdir/.reads.bps: OK
workdir/.reads.idx: OK
workdir/assembly-test.assembly-test.las: OK
workdir/assembly-test.dam: OK
workdir/assembly-test.reads.las: OK
md5sum: workdir/gap-closed-preliminary.dam: No such file or directory
workdir/gap-closed-preliminary.dam: FAILED open or read
md5sum: workdir/gap-closed-preliminary.fasta: No such file or directory
workdir/gap-closed-preliminary.fasta: FAILED open or read
md5sum: workdir/gap-closed-preliminary.gap-closed-preliminary.las: No such file or directory
workdir/gap-closed-preliminary.gap-closed-preliminary.las: FAILED open or read
md5sum: workdir/gap-closed-preliminary.reads.las: No such file or directory
workdir/gap-closed-preliminary.reads.las: FAILED open or read
workdir/reads.db: OK
md5sum: WARNING: 16 listed files could not be read

Here some system information:

Distributor ID:	Scientific
Description:	Scientific Linux release 7.9 (Nitrogen)
Release:	7.9
Codename:	Nitrogen

And here the slurm submission script:


#SBATCH -J gorteria_dentist
#SBATCH -A PROJECT_NAME
#SBATCH --output=dentist_%A.out
#SBATCH --error=dentist_%A.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=6:00:00
#SBATCH --mail-type=ALL
#SBATCH -p PARTITION


. /etc/profile.d/modules.sh
module purge
module load rhel7/default-peta4
module load use.own
module load dentist
module load mambaforge
source activate /PATH/mambaforge/envs/snakemake

snakemake --configfile=snakemake.yaml --use-singularity --profile=slurm

Thank you very much for your help!
Roman

MissingOutputException error

Hi,
I am using HiFi data to fill the gaps in a scaffold assembly. I get this error. I changed the latency to 15s from 5s
I am running it from a non-singularity source.
MissingOutputException in line 984 of /scratchdata/shripathi/dentist.v2.0.0.x86_64/Snakefile:
Job Missing files after 15 seconds:
workdir/assembly.assembly.las
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 13 completed successfully, but some output files are missing. 13
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!

Hello! Can I ask your help to troubleshoot? I keep having this error at self alignment block. This is one of the error. I have 3Gb genome I'd like to be gap filled with 33x coverage of HiFi reads.

Error in rule self_alignment_block:
    jobid: 114
    input: dentist.yml, workdir/sp_buf_purged_scaffolded_chrlevel_draft1.dam, workdir/.sp_buf_purged_scaffolded_chrlevel_draft1.bps, workdir/.sp_buf_purged_scaffolded_chrlevel_draft1.hdr, workdir/.sp_buf_purged_scaffolded_chrlevel_draft1.idx, workdir/.assembly.sp_buf_purged_scaffolded_chrlevel_draft1, workdir/.sp_buf_purged_scaffolded_chrlevel_draft1.dust.anno, workdir/.sp_buf_purged_scaffolded_chrlevel_draft1.dust.data, workdir/.sp_buf_purged_scaffolded_chrlevel_draft1.tan.anno, workdir/.sp_buf_purged_scaffolded_chrlevel_draft1.tan.data
    output: workdir/sp_buf_purged_scaffolded_chrlevel_draft1.10.sp_buf_purged_scaffolded_chrlevel_draft1.12.las, workdir/sp_buf_purged_scaffolded_chrlevel_draft1.12.sp_buf_purged_scaffolded_chrlevel_draft1.10.las
    log: logs/self-alignment.sp_buf_purged_scaffolded_chrlevel_draft1.10.12.log (check log file(s) for error message)
    conda-env: path/to/folder/.snakemake/conda/e844e04141fb5a79087f06209dc3fe6c_
    shell:
        
            {
                cd workdir
                daligner -I '-T8' -l500 -e0.7 -mdust -mtan sp_buf_purged_scaffolded_chrlevel_draft1.10 sp_buf_purged_scaffolded_chrlevel_draft1.12
            } &> logs/self-alignment.sp_buf_purged_scaffolded_chrlevel_draft1.10.12.log
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    cluster_jobid: 15340668

Error executing rule self_alignment_block on cluster (jobid: 114, external: 15340668, jobscript: path/to/folder/.snakemake/tmp.2nk_qpv5/snakejob.self_alignment_block.114.sh). For error details see the cluster log and the log files of the involved rule(s).

Unfortunately, the the logfile has nothing at all besides this :

(base) [user@l01 logs]$ cat self-alignment.sp_buf_purged_scaffolded_chrlevel_draft1.log
  Merging 144 files totaling 212,461,282 records

Thank you very much!

Getting started

Hi,

I am trying to use dentist for the first time but am having some trouble getting started. I am running dentist using singularity and have snakemake version 6.0.0 installed.

I downloaded the dentist.json and snakemake.yml files and edited them to include the relevant paths and also some options mentioned in the README (see below).

I first tried to validate the config files using the recommended command.

snakemake --configfile=snakemake.yml --use-singularity --cores=32 -f -- validate_dentist_config

INFO:    Convert SIF file to sandbox...
INFO:    Cleaning up image...
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 32
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1       validate_dentist_config
    1
Select jobs to execute...

[Mon Mar  8 15:52:21 2021]
localrule validate_dentist_config:
    input: dentist.json
    jobid: 0

INFO:    Using cached SIF image
INFO:    Convert SIF file to sandbox...
INFO:    Cleaning up image...
Job counts:
    count   jobs
    1       validate_dentist_config
    1
[Mon Mar  8 15:52:31 2021]
Finished job 0.
1 of 1 steps (100%) done
Complete log: /scratch/amackintosh/DENTIST_02/.snakemake/log/2021-03-08T155213.682050.snakemake.log

All seemed to work fine, so I then tried to run it.

snakemake --configfile=snakemake.yml --use-singularity --cores=32

INFO:    Convert SIF file to sandbox...
INFO:    Cleaning up image...
Building DAG of jobs...
MissingInputException in line 1091 of /scratch/amackintosh/DENTIST_02/Snakefile:
Missing input files for rule ref_vs_reads_alignment_block:
/scratch/amackintosh/DENTIST_02/.brenthis_ino.SP_BI_364.v1_1.contigs.dentist-self.data
/scratch/amackintosh/DENTIST_02/.brenthis_ino.SP_BI_364.v1_1.contigs.dust.anno
/scratch/amackintosh/DENTIST_02/.brenthis_ino.SP_BI_364.v1_1.contigs.dentist-self.anno
/scratch/amackintosh/DENTIST_02/.brenthis_ino.SP_BI_364.v1_1.contigs.tan.data
/scratch/amackintosh/DENTIST_02/.brenthis_ino.SP_BI_364.v1_1.contigs.tan.anno
scratch/amackintosh/DENTIST_02/.brenthis_ino.SP_BI_364.v1_1.contigs.dust.data

I am not used to using snakemake but I assume the missing input files are because a preceding process could not be executed. Is it possible that the problem lies within how I filled out the json and yaml files? The part of the json I edited the most looks like this (below), could any of these options being causing problems?

	"// This is a comment and will be ignored": [
	"You must set at least either `ploidy` and `read-coverage`",
	"or `max-coverage-reads` and `min-coverage-reads`."
	],
	"__default__": {
		"read-coverage": 66.9,
		"min-reads-per-pile-up": 3,
		"min-spanning-reads": 3,
		"join-policy": "contigs",
		"ploidy": 2,
		"max-coverage-self": 3,
		"verbose": 2,

Any help would be really appreciated.

Best,

Alex

example failing due to computeintrinsicqv

Hi! I am running the Dentist v3.0.0 on our cluster and the md5check fails on the example. I think there is again an issue with computeintrisicqv. Can you spot what's going on?

Here's the first exitStatus != 0 from process.1.log:

{"thread":47412916968384,
"logLevel":"diagnostic","
state":"post",
"command":
["computeintrinsicqv","-d19","/tmp/slurm_schradel.11559699/dentist-processPileUps-F2nX6W/pileup-52b-53f.db","/tmp/slurm_schradel.11559699/dentist-processPileUps-F2nX6W/pileup-52b-53f.pileup-52b-53f-filtered-chained-filtered.las"],
"output":["[V] processing /tmp/slurm_schradel.11559699/dentist-processPileUps-F2nX6W/pileup-52b-53f.pileup-52b-53f-filtered-chained-filtered.las",
"DatabaseFile::getTrimmedBlockInterval(): invalid block id 11559699","","/home/s/schradel/software/dentist.v3.0.0.x86_64/bin/computeintrinsicqv(+0x637f9)[0x563c2f02c7f9]","/home/s/schradel/software/dentist.v3.0.0.x86_64/bin/computeintrinsicqv(+0x5b42c)[0x563c2f02442c]","/home/s/schradel/software/dentist.v3.0.0.x86_64/bin/computeintrinsicqv(+0x16f24)[0x563c2efdff24]","/home/s/schradel/software/dentist.v3.0.0.x86_64/bin/computeintrinsicqv(+0x11ffd)[0x563c2efdaffd]","/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b465d0e6555]","/home/s/schradel/software/dentist.v3.0.0.x86_64/bin/computeintrinsicqv(+0x15479)[0x563c2efde479]",""],
"exitStatus":1,
"timestamp":637889081198764946,
"action":"execute",
"type":"command"}

Here's the output of the md5check:

gap-closed.closed-gaps.bed: FAILED
gap-closed.fasta: FAILED
workdir/.reference.bps: OK
workdir/.reference.dentist-reads-H.anno: FAILED
workdir/.reference.dentist-reads-H.data: FAILED
workdir/.reference.dentist-reads.anno: OK
workdir/.reference.dentist-reads.data: OK
workdir/.reference.dentist-self-H.anno: FAILED
workdir/.reference.dentist-self-H.data: FAILED
workdir/.reference.dentist-self.anno: OK
workdir/.reference.dentist-self.data: OK
workdir/.reference.dust.anno: OK
workdir/.reference.dust.data: OK
workdir/.reference.hdr: OK
workdir/.reference.idx: OK
workdir/.reference.tan-H.anno: FAILED
workdir/.reference.tan-H.data: FAILED
workdir/.reference.tan.anno: OK
workdir/.reference.tan.data: OK
workdir/.gap-closed-preliminary.bps: FAILED
workdir/.gap-closed-preliminary.closed-gaps.anno: FAILED
workdir/.gap-closed-preliminary.closed-gaps.data: FAILED
workdir/.gap-closed-preliminary.dentist-self.anno: FAILED
workdir/.gap-closed-preliminary.dentist-self.data: FAILED
workdir/.gap-closed-preliminary.dentist-weak-coverage.anno: FAILED
workdir/.gap-closed-preliminary.dentist-weak-coverage.data: OK
workdir/.gap-closed-preliminary.dust.anno: FAILED
workdir/.gap-closed-preliminary.dust.data: FAILED
workdir/.gap-closed-preliminary.hdr: OK
workdir/.gap-closed-preliminary.idx: FAILED
workdir/.gap-closed-preliminary.tan.anno: FAILED
workdir/.gap-closed-preliminary.tan.data: FAILED
workdir/.reads.bps: OK
workdir/.reads.dentist-reads-10B.anno: OK
workdir/.reads.dentist-reads-10B.data: OK
workdir/.reads.dentist-reads-11B.anno: OK
workdir/.reads.dentist-reads-11B.data: OK
workdir/.reads.dentist-reads-12B.anno: OK
workdir/.reads.dentist-reads-12B.data: OK
workdir/.reads.dentist-reads-1B.anno: OK
workdir/.reads.dentist-reads-1B.data: OK
workdir/.reads.dentist-reads-2B.anno: OK
workdir/.reads.dentist-reads-2B.data: OK
workdir/.reads.dentist-reads-3B.anno: OK
workdir/.reads.dentist-reads-3B.data: OK
workdir/.reads.dentist-reads-4B.anno: OK
workdir/.reads.dentist-reads-4B.data: OK
workdir/.reads.dentist-reads-5B.anno: OK
workdir/.reads.dentist-reads-5B.data: OK
workdir/.reads.dentist-reads-6B.anno: OK
workdir/.reads.dentist-reads-6B.data: OK
workdir/.reads.dentist-reads-7B.anno: OK
workdir/.reads.dentist-reads-7B.data: OK
workdir/.reads.dentist-reads-8B.anno: OK
workdir/.reads.dentist-reads-8B.data: OK
workdir/.reads.dentist-reads-9B.anno: OK
workdir/.reads.dentist-reads-9B.data: OK
workdir/.reads.dentist-self-10B.anno: OK
workdir/.reads.dentist-self-10B.data: OK
workdir/.reads.dentist-self-11B.anno: OK
workdir/.reads.dentist-self-11B.data: OK
workdir/.reads.dentist-self-12B.anno: OK
workdir/.reads.dentist-self-12B.data: OK
workdir/.reads.dentist-self-1B.anno: OK
workdir/.reads.dentist-self-1B.data: OK
workdir/.reads.dentist-self-2B.anno: OK
workdir/.reads.dentist-self-2B.data: OK
workdir/.reads.dentist-self-3B.anno: OK
workdir/.reads.dentist-self-3B.data: OK
workdir/.reads.dentist-self-4B.anno: OK
workdir/.reads.dentist-self-4B.data: OK
workdir/.reads.dentist-self-5B.anno: OK
workdir/.reads.dentist-self-5B.data: OK
workdir/.reads.dentist-self-6B.anno: OK
workdir/.reads.dentist-self-6B.data: OK
workdir/.reads.dentist-self-7B.anno: OK
workdir/.reads.dentist-self-7B.data: OK
workdir/.reads.dentist-self-8B.anno: OK
workdir/.reads.dentist-self-8B.data: OK
workdir/.reads.dentist-self-9B.anno: OK
workdir/.reads.dentist-self-9B.data: OK
workdir/.reads.idx: OK
workdir/.reads.tan-10B.anno: OK
workdir/.reads.tan-10B.data: OK
workdir/.reads.tan-11B.anno: OK
workdir/.reads.tan-11B.data: OK
workdir/.reads.tan-12B.anno: OK
workdir/.reads.tan-12B.data: OK
workdir/.reads.tan-1B.anno: OK
workdir/.reads.tan-1B.data: OK
workdir/.reads.tan-2B.anno: OK
workdir/.reads.tan-2B.data: OK
workdir/.reads.tan-3B.anno: OK
workdir/.reads.tan-3B.data: OK
workdir/.reads.tan-4B.anno: OK
workdir/.reads.tan-4B.data: OK
workdir/.reads.tan-5B.anno: OK
workdir/.reads.tan-5B.data: OK
workdir/.reads.tan-6B.anno: OK
workdir/.reads.tan-6B.data: OK
workdir/.reads.tan-7B.anno: OK
workdir/.reads.tan-7B.data: OK
workdir/.reads.tan-8B.anno: OK
workdir/.reads.tan-8B.data: OK
workdir/.reads.tan-9B.anno: OK
workdir/.reads.tan-9B.data: OK
workdir/reference.dam: OK
workdir/gap-closed-preliminary.dam: FAILED
workdir/gap-closed-preliminary.fasta: FAILED
workdir/reads.db: OK
workdir/validation-report.json: OK
reference.fasta: OK
reads.fasta: OK
md5sum: WARNING: 21 computed checksums did NOT match

I am running this on centos

LSB Version:	:core-4.1-amd64:core-4.1-noarch
Distributor ID:	CentOS
Description:	CentOS Linux release 7.9.2009 (Core)
Release:	7.9.2009
Codename:	Core

Questions about Input Reads

Hello, I have two questions regarding input reads:

1-Is it recommended to do error correction on the PB long reads before plugging them into dentist?

2-Is there a maximum coverage recommended for the input reads? Your example is 25x, so I was just wondering if there is an upper limit.

Thanks!

Math error in Snakefile

Hi,
I'm trying to use Dentist to gapclose some test data using randomly generated fasta sequence and low error long reads.

The provided dataset is working well on our SLURM cluster, but i can't work with my own data, having trouble with a math error :

`Updating job make_merge_config.
InputFunctionException in line 1279 of /work2/project/seqoccin/assemblies/gap_closing/tests/manual_tests/dentist/Snakefile:
Error:
ValueError: math domain error
Wildcards:

Traceback:
File "/work2/project/seqoccin/assemblies/gap_closing/tests/manual_tests/dentist/Snakefile", line 1282, in
File "/work2/project/seqoccin/assemblies/gap_closing/tests/manual_tests/dentist/Snakefile", line 614, in insertions_batches
~ `

I realised that line 614 in Snake file :
num_digits = int(ceil(log10(num_batches)))

num_batches = 0. And log10(0) stop the execution of the pipeline. Some solution to solves this ? (and why is num_batches 0 ?)

Thanks !!

Missing Snakefile from example test dataset

I believe this is a distinct issue from #21 as it pertains to the Snakefile rather than the .yml files that you helpfully signposted. I am using the https://github.com/a-ludi/dentist-example/releases/download/v1.0.2-2/dentist-example.tar.gz example dataset.

When I execute
(base) ubuntu@bio-xanthomonas:~/dentist-example$ snakemake --configfile=snakemake.yml --use-singularity --cores=all
I get this error message:
Error: no Snakefile found, tried Snakefile, snakefile, workflow/Snakefile, workflow/snakefile.

So, I guessed the obvious thing to do was to download this Snakefile: https://github.com/a-ludi/dentist/blob/develop/snakemake/Snakefile.

However, when I try again to execute the above command line, with this Snakefile in the current directory, I get:

IndentationError in line 113 of <tokenize>:
unindent does not match any outer indentation level (<tokenize>, line 113)
 File "/usr/lib/python3.8/tokenize.py", line 512, in _tokenize

I suspect, that this might be because I am not using the correct Snakefile?

Rule `ref_vs_reads_alignment_block` keeps failing

I'd like this ticket to be reopened, please. The error is still there with Dentist 1.0.1

Error in rule ref_vs_reads_alignment_block:
    jobid: 977
    output: workdir/scaffolds_FINAL.non-hifi.1kb.128.las, workdir/non-hifi.1kb.128.scaffolds_FINAL.las
    log: logs/ref-vs-reads-alignment.128.log (check log file(s) for error message)
    shell:
        
            {
                cd workdir
                damapper -C '-T8' -e0.7 -mdust -mdentist-self -mtan scaffolds_FINAL non-hifi.1kb.128
                LAcheck -v scaffolds_FINAL non-hifi.1kb scaffolds_FINAL.non-hifi.1kb.128.las || { echo 'Check failed. Possible solutions:

Duplicate LAs: can be fixed by LAsort from 2020-03-22 or later.

In order to ignore checks entirely you may use the environment variable SKIP_LACHECK=1. Use only if you are positive the files are in fact OK!'; (( ${SKIP_LACHECK:-0} != 0 )); }
                LAcheck -v non-hifi.1kb scaffolds_FINAL non-hifi.1kb.128.scaffolds_FINAL.las || { echo 'Check failed. Possible solutions:

Duplicate LAs: can be fixed by LAsort from 2020-03-22 or later.

In order to ignore checks entirely you may use the environment variable SKIP_LACHECK=1. Use only if you are positive the files are in fact OK!'; (( ${SKIP_LACHECK:-0} != 0 )); }
            } &> logs/ref-vs-reads-alignment.128.log
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    cluster_jobid: 208494 logs/cluster/ref_vs_reads_alignment_block/block_reads=128/jobid977_e17a6776-b2a4-4570-aba0-d97bd422ba29.out

Error executing rule ref_vs_reads_alignment_block on cluster (jobid: 977, external: 208494 logs/cluster/ref_vs_reads_alignment_block/block_reads=128/jobid977_e17a6776-b2a4-4570-aba0-d97bd422ba29.out, jobscript: /lustre/scratch116/tol/teams/team308/users/mm49/tmp/non-hifi-reads2/.snakemake/tmp.4pilq3ef/snakejob.ref_vs_reads_alignment_block.977.sh). For error details see the cluster log and the log files of the involved rule(s).

Snakemake retries the jobs a few times, but they keep on failing for the same reason, and at some point snakemake gives up and quits.

The image is v1.0.1:

$ singularity inspect dentist_v1.0.1.sif 
org.label-schema.build-arch: amd64
org.label-schema.build-date: Thursday_22_April_2021_11:19:9_UTC
org.label-schema.schema-version: 1.0
org.label-schema.usage.singularity.deffile.bootstrap: docker
org.label-schema.usage.singularity.deffile.from: aludi/dentist:v1.0.1
org.label-schema.usage.singularity.version: 3.7.2

Originally posted by @muffato in #15 (comment)

Error in rule validate_regions_block

Hi,
I used the latest version of Dentist to a genome fill gaps using ONT reads. The genome size is about 300 MB, and the draft genome has 546 contigs. The ONT reads is about 10X.
I used the following command：
snakemake --configfile=snakemake.yml --use-conda --cores=all
And got the following error message which I do not understand. Can anyone help? Thanks a lot in advance!

Error in rule validate_regions_block:
    jobid: 33
    input: workdir/gap-closed-preliminary.dam, workdir/.gap-closed-preliminary.bps, workdir/.gap-closed-preliminary.hdr, workdir/.gap-closed-preliminary.idx, workdir/reads.dam, workdir/.reads.bps, workdir/.reads.hdr, workdir/.reads.idx, workdir/.gap-closed-preliminary.closed-gaps.anno, workdir/.gap-closed-preliminary.closed-gaps.data, workdir/gap-closed-preliminary.1.reads.las
    output: workdir/validation-report.1.json, workdir/.gap-closed-preliminary.1.dentist-weak-coverage.anno, workdir/.gap-closed-preliminary.1.dentist-weak-coverage.data
    log: logs/validate-regions-block.1.log (check log file(s) for error details)
    conda-env: /storage/zhenyingLab/houyan/software_big/dentist.v4.0.0.x86_64/dentist-example/.snakemake/conda/b8ca5a6181a223c6ff65c49b8c435efe_
    shell:
        dentist validate-regions --config=dentist.yml --threads=1 --weak-coverage-mask=1.dentist-weak-coverage workdir/gap-closed-preliminary.dam workdir/reads.dam workdir/gap-closed-preliminary.1.reads.las closed-gaps > workdir/validation-report.1.json 2> logs/validate-regions-block.1.log
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

And I checked the file logs/validate-regions-block.1.log

{"executableVersion":"v4.0.0","gitCommit":"4cc3373284c0cbb2718a8d88d4db615b5f3c49d2","refDb":"workdir/gap-closed-preliminary.dam","readsDb":"workdir/reads.dam","readsAlignmentFile":"workdir/gap-closed-preliminary.1.reads.las","regions":"closed-gaps","configFile":"dentist.yml","help":false,"minCoverageReads":2,"coverageBoundsReads":[2,4294967295],"minSpanningReads":3,"reportAll":false,"ploidy":2,"properAlignmentAllowance":100,"quiet":false,"readCoverage":10,"regionContext":1000,"revertOptionNames":[],"weakCoverageWindow":500,"tracePointDistance":100,"numThreads":1,"verbosity":2,"weakCoverageMask":"1.dentist-weak-coverage"}
{"thread":139971147566400,"logLevel":"diagnostic","timestamp":638084339232475094,"function":"dentist.commands.validateRegions.RegionsValidator.run","state":"enter"}
{"thread":139971147566400,"logLevel":"diagnostic","timestamp":638084339232475484,"function":"dentist.commands.validateRegions.RegionsValidator.readInputs","state":"enter"}
{"thread":139971147566400,"logLevel":"diagnostic","state":"pre","command":["DBshow","-n","workdir/gap-closed-preliminary.dam"],"timestamp":638084339232475708,"action":"execute","type":"command"}
{"thread":139971147566400,"logLevel":"diagnostic","state":"post","command":["DBshow","-n","workdir/gap-closed-preliminary.dam"],"output":[">NW_023496800.1\tscaffold-1 :: Contig 0[0,1966]",">NW_023496800.1\tscaffold-1 :: Contig 1[2020,12298]",">NW_023496800.1\tscaffold-1 :: Contig 2[12352,17346]",">NW_023496800.1\tscaffold-1 :: Contig 3[17400,20027]",">NW_023496800.1\tscaffold-1 :: Contig 4[33786,35797]",">NW_023496800.1\tscaffold-1 :: Contig 5[35819,37076]",">NW_023496800.1\tscaffold-1 :: Contig 6[43260,44464]",">NW_023496800.1\tscaffold-1 :: Contig 7[44848,47029]",">NW_023496800.1\tscaffold-1 :: Contig 8[58668,62672]",">NW_023496800.1\tscaffold-1 :: Contig 9[90095,91767]",">NW_023496800.1\tscaffold-1 :: Contig 10[92382,93419]",">NW_023496800.1\tscaffold-1 :: Contig 11[100951,102012]",">NW_023496800.1\tscaffold-1 :: Contig 12[116195,118594]",">NW_023496800.1\tscaffold-1 :: Contig 13[120683,122409]",">NW_023496800.1\tscaffold-1 :: Contig 14[126485,127839]",">NW_023496800.1\tscaffold-1 :: Contig 15[135782,136895]",">NW_023496800.1\tscaffold-1 :: Contig 16[137681,138707]",">NW_023496800.1\tscaffold-1 :: Contig 17[138751,140574]",">NW_023496800.1\tscaffold-1 :: Contig 18[144047,145291]",">NW_023496800.1\ts"],"exitStatus":0,"timestamp":638084339232987997,"action":"execute","type":"command"}
{"thread":139971147566400,"logLevel":"diagnostic","state":"pre","command":["DBdump","workdir/gap-closed-preliminary.dam"],"timestamp":638084339234412205,"action":"execute","type":"pipe"}
{"thread":139971147566400,"logLevel":"diagnostic","state":"pre","command":["DBdump","-h","workdir/gap-closed-preliminary.dam"],"timestamp":638084339234592626,"action":"execute","type":"pipe"}
{"thread":139971147566400,"logLevel":"diagnostic","state":"pre","command":["DBdump","workdir/reads.dam"],"timestamp":638084339238570890,"action":"execute","type":"pipe"}
{"thread":139971147566400,"logLevel":"diagnostic","state":"pre","command":["DBdump","-h","workdir/reads.dam"],"timestamp":638084339239372035,"action":"execute","type":"pipe"}
{"thread":139971147566400,"logLevel":"diagnostic","state":"pre","command":["DBdump","workdir/gap-closed-preliminary.dam"],"timestamp":638084339295504008,"action":"execute","type":"pipe"}
{"thread":139971147566400,"logLevel":"diagnostic","state":"pre","command":["DBdump","workdir/gap-closed-preliminary.dam"],"timestamp":638084339295554201,"action":"execute","type":"pipe"}
{"thread":139971147566400,"logLevel":"info","numRegions":0,"contigABounds":[10,29154],"numContigIds":0,"timestamp":638084339295636137,"numReadIds":0}
{"thread":139971147566400,"logLevel":"diagnostic","timestamp":638084339295636589,"function":"dentist.commands.validateRegions.RegionsValidator.readInputs","state":"exit","timeElapsed":63160979}
{"thread":139971147566400,"logLevel":"diagnostic","timestamp":638084339295636834,"function":"dentist.commands.validateRegions.RegionsValidator.validateRegions","state":"enter"}
{"thread":139971147566400,"logLevel":"diagnostic","timestamp":638084339295639196,"function":"dentist.commands.validateRegions.RegionsValidator.validateRegions","state":"exit","timeElapsed":2282}
{"thread":139971147566400,"logLevel":"diagnostic","timestamp":638084339295639563,"function":"dentist.commands.validateRegions.RegionsValidator.run","state":"exit","timeElapsed":63164067}
Error: object.Exception@/home/alu/.local/share/mambaforge/envs/dentist/conda-bld/dentist-core_1663094479349/work/.dlang/dmd-2.100.2/linux/bin64/../../src/phobos/std/parallelism.d(1636): workUnitSize must be > 0.
----------------
??:? pure @safe noreturn std.exception.bailOut!(Exception).bailOut(immutable(char)[], ulong, scope const(char)[]) [0x12c8a82]
??:? pure @safe bool std.exception.enforce!().enforce!(bool).enforce(bool, lazy const(char)[], immutable(char)[], ulong) [0x12c89fc]
??:? pure @safe std.parallelism.ParallelForeach!(std.range.Zip!(dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[], dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[], uint[2][], uint[][]).Zip).ParallelForeach std.parallelism.TaskPool.parallel!(std.range.Zip!(dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[], dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[], uint[2][], uint[][]).Zip).parallel(std.range.Zip!(dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[], dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[], uint[2][], uint[][]).Zip, ulong) [0x1065468]
??:? pure @safe std.parallelism.ParallelForeach!(std.range.Zip!(dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[], dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[], uint[2][], uint[][]).Zip).ParallelForeach std.parallelism.TaskPool.parallel!(std.range.Zip!(dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[], dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[], uint[2][], uint[][]).Zip).parallel(std.range.Zip!(dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[], dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[], uint[2][], uint[][]).Zip) [0x1065417]
??:? @safe std.parallelism.ParallelForeach!(std.range.Zip!(dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[], dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[], uint[2][], uint[][]).Zip).ParallelForeach std.parallelism.parallel!(std.range.Zip!(dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[], dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[], uint[2][], uint[][]).Zip).parallel(std.range.Zip!(dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[], dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[], uint[2][], uint[][]).Zip) [0x1064dd7]
??:? void dentist.commands.validateRegions.RegionsValidator.validateRegions() [0x10584c8]
??:? void dentist.commands.validateRegions.RegionsValidator.run() [0x1057712]
??:? void dentist.commands.validateRegions.execute(in dentist.commandline.OptionsFor!(16).OptionsFor) [0x105768e]
??:? dentist.commandline.ReturnCode dentist.commandline.runCommand!(16).runCommand(in immutable(char)[][]) [0x10132ed]
??:? dentist.commandline.ReturnCode dentist.commandline.run(in immutable(char)[][]) [0xf1a5b9]
??:? _Dmain [0xd94f07]

Edit 2023-01-05: fixed formatting for better readbility.

Pacbio header line format error

Hi,
I am getting a pacbio fasta header format error and I was wondering what format it is looking for? Here is a link to the terminal output.

The pacbio fasta headers look like this: >pacbio_SRR6282347.1.1 1 length=6524

There is a second error message I am not sure about either. The log file shows a segmentation dump

/bin/bash: line 5: 208846 Segmentation fault      (core dumped) datander '-T70' -s126 -l500 -e0.7 Ajap_genome.2

Conda-install-error

when I run "snakemake --configfile=snakemake.yml --use-conda" or "snakemake --configfile=snakemake.yml --use-conda --profile=slurm"
it told me that "SyntaxError in line 920 of /public/home/GL_lixn/biosoft/dentist.v3.0.0.x86_64/Snakefile:
Unexpected keyword container in rule definition (Snakefile, line 920)"

No joined contigs with greedy configuration

Hi,

First, thanks for developing Dentist!

I have an issue very similar to @oushujun in #22 : basically I am trying to scaffold contigs with another assembly, and I am using the greedy configuration file for that.

After running Dentist, no contigs are joined, and the gap-closed.closed-gaps.bed file is empty. I ran the lost-gaps.py script, which gives the following report:

"In this run of DENTIST 4837 potentially closable gaps were not closed. More details:

Hint: use DBshow -n workdir/[REFERENCE].dam | cat -n to translate contig numbers to FASTA
coordinates.

lost 4 in collect phase
- lost 0 gap(s) because of insufficient number of spanning reads (--min-spanning-reads=1)
- lost 4 gap(s) because a scaffolding conflict was detected
  - conflicting gap closings: 1890-5809 (1 reads), 1890-28348 (1 reads)
  - conflicting gap closings: 2063-6912 (1 reads), 2063-15318 (1 reads)
  - conflicting gap closings: 5927-21838 (1 reads), 6196-21838 (1 reads)
  - conflicting gap closings: 4971-27615 (1 reads), 23980-27615 (1 reads)
lost 4833 in process phase
- skipped 1389 read pile ups because of errors
  - consensus failed (1274 times)
  - other (115 times)
- skipped 3444 read pile ups because of --only=spanning
lost 0 in output phase
- skipped 0 insertion(s) because of --max-insertion-error=0.1
- skipped 0 insertion(s) because of --join-policy=contigs
- skipped 0 extension(s) because of --min-extension-length=100"

Looking into the process phase logs, I found 1273 errors reading "consensus alignment is invalid" and 230 errors "process DASqv returned with non-zero exit code 1: DASqv: Average coverage is too low (< 4X), cannot infer qv's\n".

Should I change some parameters in the configuration file?

Best regards
Yann

LAsort fails in dailgner2.0

Hi Arne,

The LAsort and LAmerge calls seem to have a bit of a problem if I use them from the daligner that I installed from git. Notice the “.N@” that gets appended. However, the LAsort and LAmerge from conda seem to work fine. However, the LAcheck from conda fails the “ref_vs_reads_alignment_block” because it tries to append “.las” to the second argument. The LAcheck from the daligner that I installed from git does work fine though.

daligner2.0: Command Failed:
LAsort /mnt/shared_tmp/376067.1.work.q/daligner.7314/scaffolds_FINAL.1.scaffolds_FINAL.7.N@

So, the furthest that we’ve progressed is with local damapper and LAcheck. The other prerequisites are from conda.

Originally posted by @BradleyRan in #3 (comment)

rule mask_self fails on Docker container

I'm trying to use dentist on a docker container, but the example always fails during the mask_self step. From my mac, I'm starting the container using docker run -it --rm=true --platform linux/x86_64 centos:7 /bin/bash. I'm then running the following:

yum update -y -q
yum install -y wget

wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
sh Mambaforge-Linux-x86_64.sh -b -p /opt/mamba3
export PATH=/opt/mamba3/bin:$PATH
rm Mambaforge-Linux-x86_64.sh

mamba install -c conda-forge -c bioconda -y snakemake

wget https://github.com/a-ludi/dentist/releases/download/v3.0.0/dentist-example.tar.gz
tar -xzf dentist-example.tar.gz
cd dentist-example

snakemake --configfile=snakemake.yml --use-conda --cores=1

This causes the following error:

Error in rule mask_self:
    jobid: 11
    output: workdir/.reference.dentist-self.anno, workdir/.reference.dentist-self.data
    log: logs/mask-self.reference.log (check log file(s) for error message)
    conda-env: /dentist-example/.snakemake/conda/850bc5c09e81d3d6b875839f8fe0ed70
    shell:
        dentist mask --config=dentist.yml  workdir/reference.dam workdir/reference.reference.las dentist-self 2> logs/mask-self.reference.log
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

The contents of logs/mask-self.reference.log:

{"executableVersion":"v3.0.0","refDb":"workdir/reference.dam","readsDb":"","dbAlignmentFile":"workdir/reference.reference.las","repeatMask":"dentist-self","configFile":"dentist.yml","debugRepeatMasks":false,"help":false,"maxCoverageReads":4294967295,"coverageBoundsReads":[0,0],"maxCoverageSelf":3,"coverageBoundsSelf":[0,3],"maxImproperCoverageReads":4294967295,"improperCoverageBoundsReads":[0,0],"properAlignmentAllowance":100,"quiet":false,"readCoverage":20,"revertOptionNames":[],"tracePointDistance":100,"verbosity":2}
{"thread":274939902976,"logLevel":"diagnostic","timestamp":637783965479794038,"function":"dentist.commands.maskRepetitiveRegions.RepeatMaskAssessor.run","state":"enter"}
{"thread":274939902976,"logLevel":"diagnostic","timestamp":637783965479801084,"function":"dentist.commands.maskRepetitiveRegions.RepeatMaskAssessor.readInputs","state":"enter"}
{"thread":274939902976,"logLevel":"diagnostic","state":"pre","command":["DBdump","workdir/reference.dam"],"timestamp":637783965479844193,"action":"execute","type":"pipe"}
{"thread":274939902976,"logLevel":"diagnostic","timestamp":637783965480006401,"function":"dentist.commands.maskRepetitiveRegions.RepeatMaskAssessor.readInputs","state":"exit","timeElapsed":203652}
{"thread":274939902976,"logLevel":"diagnostic","timestamp":637783965480009609,"function":"dentist.commands.maskRepetitiveRegions.RepeatMaskAssessor.run","state":"exit","timeElapsed":208842}
core.exception.AssertError@source/dentist/util/process.d(215): Attempting to fetch the front of an empty LinesPipe
----------------
??:? _d_assert_msg [0x40010b3e06]
??:? uint dentist.dazzler.numDbRecords(in immutable(char)[]) [0x4000dbef18]
??:? dentist.util.process.LinesPipe!(dentist.util.process.ProcessInfo, 0).LinesPipe dentist.dazzler.dbdump!(ulong[]).dbdump(in immutable(char)[], ulong[], in immutable(char)[][]) [0x4000dcbdad]
??:? uint[] dentist.dazzler.LocalAlignmentReader.contigLengths(immutable(char)[]) [0x4000dbfff3]
??:? dentist.dazzler.LocalAlignmentReader dentist.dazzler.LocalAlignmentReader.__ctor(const(immutable(char)[]), immutable(char)[], immutable(char)[], dentist.dazzler.BufferMode, dentist.common.alignments.base.TracePoint[]) [0x4000dbfc21]
??:? void dentist.commands.maskRepetitiveRegions.RepeatMaskAssessor.readInputs() [0x4000c73d52]
??:? void dentist.commands.maskRepetitiveRegions.RepeatMaskAssessor.run() [0x4000c73b09]
??:? dentist.commandline.ReturnCode dentist.commandline.runCommand!(2).runCommand(in immutable(char)[][]) [0x4000bbd656]
??:? dentist.commandline.ReturnCode dentist.commandline.run(in immutable(char)[][]) [0x4000b2850d]
??:? _Dmain [0x40009bee67]

Strangely, I can run the same series of commands on a CentOS 7 cluster with no issue. The contents of logs/mask-self.reference.log for the successful run are

{"executableVersion":"v3.0.0","refDb":"workdir/reference.dam","readsDb":"","dbAlignmentFile":"workdir/reference.reference.las","repeatMask":"dentist-self","configFile":"dentist.yml","debugRepeatMasks":false,"help":false,"maxCoverageReads":4294967295,"coverageBoundsReads":[0,0],"maxCoverageSelf":3,"coverageBoundsSelf":[0,3],"maxImproperCoverageReads":4294967295,"improperCoverageBoundsReads":[0,0],"properAlignmentAllowance":100,"quiet":false,"readCoverage":20,"revertOptionNames":[],"tracePointDistance":100,"verbosity":2}
{"thread":22377401060544,"logLevel":"diagnostic","timestamp":637783913924743264,"function":"dentist.commands.maskRepetitiveRegions.RepeatMaskAssessor.run","state":"enter"}
{"thread":22377401060544,"logLevel":"diagnostic","timestamp":637783913924743707,"function":"dentist.commands.maskRepetitiveRegions.RepeatMaskAssessor.readInputs","state":"enter"}
{"thread":22377401060544,"logLevel":"diagnostic","state":"pre","command":["DBdump","workdir/reference.dam"],"timestamp":637783913924766164,"action":"execute","type":"pipe"}
{"thread":22377401060544,"logLevel":"diagnostic","state":"pre","command":["DBdump","-h","workdir/reference.dam"],"timestamp":637783913924790405,"action":"execute","type":"pipe"}
{"thread":22377401060544,"logLevel":"diagnostic","state":"pre","command":["DBdump","workdir/reference.dam"],"timestamp":637783913924830211,"action":"execute","type":"pipe"}
{"thread":22377401060544,"logLevel":"diagnostic","state":"pre","command":["DBdump","-h","workdir/reference.dam"],"timestamp":637783913924851842,"action":"execute","type":"pipe"}
{"thread":22377401060544,"logLevel":"diagnostic","timestamp":637783913924889160,"function":"dentist.commands.maskRepetitiveRegions.RepeatMaskAssessor.readInputs","state":"exit","timeElapsed":145216}
{"thread":22377401060544,"logLevel":"diagnostic","timestamp":637783913924889516,"function":"dentist.commands.maskRepetitiveRegions.RepeatMaskAssessor.assessRepeatStructure","state":"enter"}
{"thread":22377401060544,"logLevel":"diagnostic","state":"pre","command":["DBdump","workdir/reference.dam"],"timestamp":637783913924890069,"action":"execute","type":"pipe"}
{"thread":22377401060544,"logLevel":"diagnostic","state":"pre","command":["DBdump","-r","-h","workdir/reference.dam"],"timestamp":637783913924911577,"action":"execute","type":"pipe"}
{"alignmentType":"self","thread":22377401060544,"logLevel":"diagnostic","timestamp":637783913925023014,"repetitiveRegions":null,"numRepetitiveRegions":185}
{"thread":22377401060544,"logLevel":"diagnostic","timestamp":637783913925023427,"function":"dentist.commands.maskRepetitiveRegions.RepeatMaskAssessor.assessRepeatStructure","state":"exit","timeElapsed":133711}
{"thread":22377401060544,"logLevel":"diagnostic","timestamp":637783913925023646,"function":"dentist.commands.maskRepetitiveRegions.RepeatMaskAssessor.writeRepeatMask","state":"enter"}
{"thread":22377401060544,"logLevel":"diagnostic","state":"pre","command":["DBdump","workdir/reference.dam"],"timestamp":637783913925025791,"action":"execute","type":"pipe"}
{"thread":22377401060544,"logLevel":"diagnostic","state":"pre","command":["DBdump","workdir/reference.dam"],"timestamp":637783913925047965,"action":"execute","type":"pipe"}
{"thread":22377401060544,"logLevel":"diagnostic","timestamp":637783913925069691,"function":"dentist.commands.maskRepetitiveRegions.RepeatMaskAssessor.writeRepeatMask","state":"exit","timeElapsed":45797}
{"thread":22377401060544,"logLevel":"diagnostic","timestamp":637783913925069958,"function":"dentist.commands.maskRepetitiveRegions.RepeatMaskAssessor.run","state":"exit","timeElapsed":326260}

I've also tried this on the condaforge/mambaforge:4.11.0-0 container, but I get the same error.

Any insights are greatly appreciated!

Can gaps be filled with just extension reads?

Hello,

Can Dentist be configured to fill gaps with just extension reads (the purple and orange ones in fig) ? (e.g. by setting "min-spanning-reads" to 0)
(From the DENTIST paper supplementary figure 1)

Thanks,
Tim

Syntax issue in snakemake

Hi,

Apologies that I'm pasting screenshots here, in unable to login through my workstation.

The following are the versions that I'm using:
Snakemake: 3.13.3
dentist v1.0.0-beta.1 (commit 9e049a9)

Issue:

The second is the location of the same with respect to the Snakemake file:

I'd appreciate any help regarding the same

Damapper command fails

Hi,
thank you for dentist, it is an intriguing tool.

However, it is a bit annoying for me. I keep getting errors like the one below, without any useful information.

damapper: Command Failed:
              LAmerge  -a reference.reads.439 /cluster/work/users/olekto/tmp/damapper.81025/[email protected]

LAmerge: Did not write all records to reference.reads.426 (8035)

damapper: Command Failed:
              LAmerge  -a reference.reads.426 /tmp/damapper.69457/[email protected]

I wondered if the tmp dir was too small, so I set that to a shared folder that can contain multiple terabytes, so I doubt that is the issue anymore.

Some partitions go through fine, however, it is only a handful before one throws an error.

Is there a way to get more useful debugging information? So that I can know how the commands fail and can address that issue?

Thank you.

Sincerely,
Ole

Snakefile calls to LAcheck in shell sections fail

In Snakefile, some of the shell command sections use "cd workdir" but then include "workdir" in the path arguments to LAcheck.

example output:
cd workdir
datander -l500 -e0.980100 '-T1' scaffolds_FINAL
if (( ${SKIP_LACHECK:-0} == 0 ))
then
LAcheck -v scaffolds_FINAL workdir/TAN.scaffolds_FINAL.las || echo Try setting the environment variable SKIP_LACHECK=1 if the error is Duplicate LAs; otherwise rerun this block
else
echo "Skipping LAcheck due to user request"
fi
} &> logs/tandem-alignment.log

md5checksum shows example dataset analysis fails

Hi, I've been trying to use dentist on the provided example dataset but a number of the md5 check sums after it finishes running are failing with no other errors that I can find.

I installed snakemake v6.0.0 and singularity v3.6.3 through conda and ran through the example dataset as follows:

wget https://bds.mpi-cbg.de/hillerlab/DENTIST/dentist-example.v1.0.1.tar.gz
tar -xzf ./dentist-example.v1.0.1.tar.gz
cd dentist-example

# run the workflow
SKIP_LACHECK=1 snakemake --configfile=snakemake.yaml --use-singularity --cores=4 

# validate the files
md5sum -c checksum.md5

but the checksum output was as follows:

gap-closed.fasta: FAILED
workdir/.assembly-test.bps: OK
workdir/.assembly-test.dentist-reads.anno: OK
workdir/.assembly-test.dentist-reads.data: OK
workdir/.assembly-test.dentist-self.anno: OK
workdir/.assembly-test.dentist-self.data: OK
workdir/.assembly-test.dust.anno: OK
workdir/.assembly-test.dust.data: OK
workdir/.assembly-test.hdr: OK
workdir/.assembly-test.idx: OK
workdir/.assembly-test.tan.anno: OK
workdir/.assembly-test.tan.data: OK
workdir/.gap-closed-preliminary.bps: FAILED
workdir/.gap-closed-preliminary.dentist-self.anno: FAILED
workdir/.gap-closed-preliminary.dentist-self.data: FAILED
workdir/.gap-closed-preliminary.dentist-weak-coverage.anno: FAILED
workdir/.gap-closed-preliminary.dentist-weak-coverage.data: FAILED
workdir/.gap-closed-preliminary.dust.anno: FAILED
workdir/.gap-closed-preliminary.dust.data: FAILED
workdir/.gap-closed-preliminary.hdr: OK
workdir/.gap-closed-preliminary.idx: FAILED
workdir/.gap-closed-preliminary.tan.anno: FAILED
workdir/.gap-closed-preliminary.tan.data: FAILED
workdir/.reads.bps: OK
workdir/.reads.idx: OK
workdir/assembly-test.assembly-test.las: OK
workdir/assembly-test.dam: OK
workdir/assembly-test.reads.las: OK
workdir/gap-closed-preliminary.dam: FAILED
workdir/gap-closed-preliminary.fasta: FAILED
workdir/gap-closed-preliminary.gap-closed-preliminary.las: FAILED
workdir/gap-closed-preliminary.reads.las: FAILED
workdir/reads.db: OK
md5sum: WARNING: 15 computed checksums did NOT match

any advice on how to get the example dataset running would be greatly appreciated,
Thanks,
Rishi

Error: File stdout.fasta, Line 1: First header in fasta file is missing

Hi,
I am trying to run dentitst and got that error.

File stdout.fasta, Line 1: First header in fasta file is missing

I am thinking it might be because of the names on the contigs in the ref ? My seqeunce names only contains "_" letters and numbers.

Any idea ?
Thanks for your help

md5checksum differences between nodes

Hi Arne,

I hope you are well. First of all congrats on the very exciting tool, I'm really looking forward to try it out on my ultra long Nanopore data!

However, I'm having a bit of trouble getting dentist set up. For starters, the example dataset only gets the md5 checksums right when I run it on a specific high memory node and not in any of our other computing nodes. In addition, this only works when used together with your singularity image and not with the bundled binaries in the bin folder of the example dataset.

Also, when I run it through slurm, there's yet an additional problem in that it reproducibly stops right after the collect check point and I need to run it again to finish the rest of the process. I'm not sure whether this might be due to the first issue since I haven't been able to queue it to that specific node but I'm in the process of checking.

Any ideas about where the problem might lie?

Thanks!!

Succesfull node (high memory)

[diego.terrones@clip-m1-0 dentist-example]$ md5sum -c checksum.md5
gap-closed.closed-gaps.bed: OK
gap-closed.fasta: OK
workdir/.assembly-test.bps: OK
workdir/.assembly-test.dentist-reads-H.anno: OK
workdir/.assembly-test.dentist-reads-H.data: OK
workdir/.assembly-test.dentist-reads.anno: OK
workdir/.assembly-test.dentist-reads.data: OK
workdir/.assembly-test.dentist-self-H.anno: OK
workdir/.assembly-test.dentist-self-H.data: OK
workdir/.assembly-test.dentist-self.anno: OK
workdir/.assembly-test.dentist-self.data: OK
workdir/.assembly-test.dust.anno: OK
workdir/.assembly-test.dust.data: OK
workdir/.assembly-test.hdr: OK
workdir/.assembly-test.idx: OK
workdir/.assembly-test.tan-H.anno: OK
workdir/.assembly-test.tan-H.data: OK
workdir/.assembly-test.tan.anno: OK
workdir/.assembly-test.tan.data: OK
workdir/.gap-closed-preliminary.bps: OK
workdir/.gap-closed-preliminary.closed-gaps.anno: OK
workdir/.gap-closed-preliminary.closed-gaps.data: OK
workdir/.gap-closed-preliminary.dentist-self.anno: OK
workdir/.gap-closed-preliminary.dentist-self.data: OK
workdir/.gap-closed-preliminary.dentist-weak-coverage.anno: OK
workdir/.gap-closed-preliminary.dentist-weak-coverage.data: OK
workdir/.gap-closed-preliminary.dust.anno: OK
workdir/.gap-closed-preliminary.dust.data: OK
workdir/.gap-closed-preliminary.hdr: OK
workdir/.gap-closed-preliminary.idx: OK
workdir/.gap-closed-preliminary.tan.anno: OK
workdir/.gap-closed-preliminary.tan.data: OK
workdir/.reads.bps: OK
workdir/.reads.dentist-reads-10B.anno: OK
workdir/.reads.dentist-reads-10B.data: OK
workdir/.reads.dentist-reads-11B.anno: OK
workdir/.reads.dentist-reads-11B.data: OK
workdir/.reads.dentist-reads-12B.anno: OK
workdir/.reads.dentist-reads-12B.data: OK
workdir/.reads.dentist-reads-1B.anno: OK
workdir/.reads.dentist-reads-1B.data: OK
workdir/.reads.dentist-reads-2B.anno: OK
workdir/.reads.dentist-reads-2B.data: OK
workdir/.reads.dentist-reads-3B.anno: OK
workdir/.reads.dentist-reads-3B.data: OK
workdir/.reads.dentist-reads-4B.anno: OK
workdir/.reads.dentist-reads-4B.data: OK
workdir/.reads.dentist-reads-5B.anno: OK
workdir/.reads.dentist-reads-5B.data: OK
workdir/.reads.dentist-reads-6B.anno: OK
workdir/.reads.dentist-reads-6B.data: OK
workdir/.reads.dentist-reads-7B.anno: OK
workdir/.reads.dentist-reads-7B.data: OK
workdir/.reads.dentist-reads-8B.anno: OK
workdir/.reads.dentist-reads-8B.data: OK
workdir/.reads.dentist-reads-9B.anno: OK
workdir/.reads.dentist-reads-9B.data: OK
workdir/.reads.dentist-self-10B.anno: OK
workdir/.reads.dentist-self-10B.data: OK
workdir/.reads.dentist-self-11B.anno: OK
workdir/.reads.dentist-self-11B.data: OK
workdir/.reads.dentist-self-12B.anno: OK
workdir/.reads.dentist-self-12B.data: OK
workdir/.reads.dentist-self-1B.anno: OK
workdir/.reads.dentist-self-1B.data: OK
workdir/.reads.dentist-self-2B.anno: OK
workdir/.reads.dentist-self-2B.data: OK
workdir/.reads.dentist-self-3B.anno: OK
workdir/.reads.dentist-self-3B.data: OK
workdir/.reads.dentist-self-4B.anno: OK
workdir/.reads.dentist-self-4B.data: OK
workdir/.reads.dentist-self-5B.anno: OK
workdir/.reads.dentist-self-5B.data: OK
workdir/.reads.dentist-self-6B.anno: OK
workdir/.reads.dentist-self-6B.data: OK
workdir/.reads.dentist-self-7B.anno: OK
workdir/.reads.dentist-self-7B.data: OK
workdir/.reads.dentist-self-8B.anno: OK
workdir/.reads.dentist-self-8B.data: OK
workdir/.reads.dentist-self-9B.anno: OK
workdir/.reads.dentist-self-9B.data: OK
workdir/.reads.idx: OK
workdir/.reads.tan-10B.anno: OK
workdir/.reads.tan-10B.data: OK
workdir/.reads.tan-11B.anno: OK
workdir/.reads.tan-11B.data: OK
workdir/.reads.tan-12B.anno: OK
workdir/.reads.tan-12B.data: OK
workdir/.reads.tan-1B.anno: OK
workdir/.reads.tan-1B.data: OK
workdir/.reads.tan-2B.anno: OK
workdir/.reads.tan-2B.data: OK
workdir/.reads.tan-3B.anno: OK
workdir/.reads.tan-3B.data: OK
workdir/.reads.tan-4B.anno: OK
workdir/.reads.tan-4B.data: OK
workdir/.reads.tan-5B.anno: OK
workdir/.reads.tan-5B.data: OK
workdir/.reads.tan-6B.anno: OK
workdir/.reads.tan-6B.data: OK
workdir/.reads.tan-7B.anno: OK
workdir/.reads.tan-7B.data: OK
workdir/.reads.tan-8B.anno: OK
workdir/.reads.tan-8B.data: OK
workdir/.reads.tan-9B.anno: OK
workdir/.reads.tan-9B.data: OK
workdir/assembly-test.dam: OK
workdir/gap-closed-preliminary.dam: OK
workdir/gap-closed-preliminary.fasta: OK
workdir/reads.db: OK
workdir/validation-report.json: OK

Config

[diego.terrones@clip-m1-0 dentist-example]$ lsb_release -a
LSB Version:	:core-4.1-amd64:core-4.1-noarch
Distributor ID:	CentOS
Description:	CentOS Linux release 7.9.2009 (Core)
Release:	7.9.2009
Codename:	Core
[diego.terrones@clip-m1-0 dentist-example]$ free -h
              total        used        free      shared  buff/cache   available
Mem:           1.9T        943G        938G         89M         55G        991G
Swap:            0B          0B          0B

All other nodes

[diego.terrones@clip-c2-2 dentist-example]$ md5sum -c checksum.md5
gap-closed.closed-gaps.bed: FAILED
gap-closed.fasta: FAILED
workdir/.assembly-test.bps: OK
workdir/.assembly-test.dentist-reads-H.anno: OK
workdir/.assembly-test.dentist-reads-H.data: OK
workdir/.assembly-test.dentist-reads.anno: OK
workdir/.assembly-test.dentist-reads.data: OK
workdir/.assembly-test.dentist-self-H.anno: OK
workdir/.assembly-test.dentist-self-H.data: OK
workdir/.assembly-test.dentist-self.anno: OK
workdir/.assembly-test.dentist-self.data: OK
workdir/.assembly-test.dust.anno: OK
workdir/.assembly-test.dust.data: OK
workdir/.assembly-test.hdr: OK
workdir/.assembly-test.idx: OK
workdir/.assembly-test.tan-H.anno: OK
workdir/.assembly-test.tan-H.data: OK
workdir/.assembly-test.tan.anno: OK
workdir/.assembly-test.tan.data: OK
workdir/.gap-closed-preliminary.bps: FAILED
workdir/.gap-closed-preliminary.closed-gaps.anno: FAILED
workdir/.gap-closed-preliminary.closed-gaps.data: FAILED
workdir/.gap-closed-preliminary.dentist-self.anno: FAILED
workdir/.gap-closed-preliminary.dentist-self.data: FAILED
workdir/.gap-closed-preliminary.dentist-weak-coverage.anno: FAILED
workdir/.gap-closed-preliminary.dentist-weak-coverage.data: OK
workdir/.gap-closed-preliminary.dust.anno: FAILED
workdir/.gap-closed-preliminary.dust.data: FAILED
workdir/.gap-closed-preliminary.hdr: OK
workdir/.gap-closed-preliminary.idx: FAILED
workdir/.gap-closed-preliminary.tan.anno: FAILED
workdir/.gap-closed-preliminary.tan.data: FAILED
workdir/.reads.bps: OK
workdir/.reads.dentist-reads-10B.anno: OK
workdir/.reads.dentist-reads-10B.data: OK
workdir/.reads.dentist-reads-11B.anno: OK
workdir/.reads.dentist-reads-11B.data: OK
workdir/.reads.dentist-reads-12B.anno: OK
workdir/.reads.dentist-reads-12B.data: OK
workdir/.reads.dentist-reads-1B.anno: OK
workdir/.reads.dentist-reads-1B.data: OK
workdir/.reads.dentist-reads-2B.anno: OK
workdir/.reads.dentist-reads-2B.data: OK
workdir/.reads.dentist-reads-3B.anno: OK
workdir/.reads.dentist-reads-3B.data: OK
workdir/.reads.dentist-reads-4B.anno: OK
workdir/.reads.dentist-reads-4B.data: OK
workdir/.reads.dentist-reads-5B.anno: OK
workdir/.reads.dentist-reads-5B.data: OK
workdir/.reads.dentist-reads-6B.anno: OK
workdir/.reads.dentist-reads-6B.data: OK
workdir/.reads.dentist-reads-7B.anno: OK
workdir/.reads.dentist-reads-7B.data: OK
workdir/.reads.dentist-reads-8B.anno: OK
workdir/.reads.dentist-reads-8B.data: OK
workdir/.reads.dentist-reads-9B.anno: OK
workdir/.reads.dentist-reads-9B.data: OK
workdir/.reads.dentist-self-10B.anno: OK
workdir/.reads.dentist-self-10B.data: OK
workdir/.reads.dentist-self-11B.anno: OK
workdir/.reads.dentist-self-11B.data: OK
workdir/.reads.dentist-self-12B.anno: OK
workdir/.reads.dentist-self-12B.data: OK
workdir/.reads.dentist-self-1B.anno: OK
workdir/.reads.dentist-self-1B.data: OK
workdir/.reads.dentist-self-2B.anno: OK
workdir/.reads.dentist-self-2B.data: OK
workdir/.reads.dentist-self-3B.anno: OK
workdir/.reads.dentist-self-3B.data: OK
workdir/.reads.dentist-self-4B.anno: OK
workdir/.reads.dentist-self-4B.data: OK
workdir/.reads.dentist-self-5B.anno: OK
workdir/.reads.dentist-self-5B.data: OK
workdir/.reads.dentist-self-6B.anno: OK
workdir/.reads.dentist-self-6B.data: OK
workdir/.reads.dentist-self-7B.anno: OK
workdir/.reads.dentist-self-7B.data: OK
workdir/.reads.dentist-self-8B.anno: OK
workdir/.reads.dentist-self-8B.data: OK
workdir/.reads.dentist-self-9B.anno: OK
workdir/.reads.dentist-self-9B.data: OK
workdir/.reads.idx: OK
workdir/.reads.tan-10B.anno: OK
workdir/.reads.tan-10B.data: OK
workdir/.reads.tan-11B.anno: OK
workdir/.reads.tan-11B.data: OK
workdir/.reads.tan-12B.anno: OK
workdir/.reads.tan-12B.data: OK
workdir/.reads.tan-1B.anno: OK
workdir/.reads.tan-1B.data: OK
workdir/.reads.tan-2B.anno: OK
workdir/.reads.tan-2B.data: OK
workdir/.reads.tan-3B.anno: OK
workdir/.reads.tan-3B.data: OK
workdir/.reads.tan-4B.anno: OK
workdir/.reads.tan-4B.data: OK
workdir/.reads.tan-5B.anno: OK
workdir/.reads.tan-5B.data: OK
workdir/.reads.tan-6B.anno: OK
workdir/.reads.tan-6B.data: OK
workdir/.reads.tan-7B.anno: OK
workdir/.reads.tan-7B.data: OK
workdir/.reads.tan-8B.anno: OK
workdir/.reads.tan-8B.data: OK
workdir/.reads.tan-9B.anno: OK
workdir/.reads.tan-9B.data: OK
workdir/assembly-test.dam: OK
workdir/gap-closed-preliminary.dam: FAILED
workdir/gap-closed-preliminary.fasta: FAILED
workdir/reads.db: OK
workdir/validation-report.json: OK
md5sum: WARNING: 15 computed checksums did NOT match

Could not create pipe to check startup of child (Too many open files)

Hi,

I am using ONT data and I got the error in the title from the pileupCollector (in the collect.log).
I tried to increase the number of tolerated opened files with: ulimit -n 100000 and I set the batch_size: 100 in the snakemake.yml file but this did not solve the problem. Any advice on how to solve this?

Thanks a lot!

Nanopore reads

Heya,

I was just wondering if this could be utilised on nanopore assemblies? and if so what parameter I should specify for READ_TYPE in the snakemake.yml file.

Many thanks,
Ann

`subprocess.run` got an unexpected keyword argument 'text'

I am trying to run DENTIST installed with conda on a Linux machine (mamba and snakemake also installed in the same env as DENTIST).
My working directory contains 5 files and 2 directories:

short_read_assembly_contig.fa
PacBio_reads.fastq.gz 
Snakefile
snakemake.yml 
dentist.yml
envs\
scripts\

I have not modified dentist.yml file and the only modifications to snakemake.yml I made are file names.

When I try to run it with the following command - snakemake --configfile=snakemake.yml --use-conda --cores=16, I get this error:

SyntaxError in line 930 of /ANALYSIS1/SRS_genome_asm/Snakefile:
Unexpected keyword container in rule definition (Snakefile, line 930)

What can be causing this?

Setting dentist parameters

When setting the parameters below, do these need to be included in the dentist.json config file? and if so in which section?

--max-insertion-error
--min-anchor-length
--min-reads-per-pile-up
--min-spanning-reads
--allow-single-reads
--join-policy

'Wildcards' object has no attribute 'memory'

Hello,

I have been attempting to get dentist to complete on a cluster managed by slurm. Dentist fails with this message:

WorkflowError in line 562 of /lustre/project/gbru/gbru_X/RawData/dentist/Snakefile:
'Wildcards' object has no attribute 'memory'
  File "/software/7/apps/snakemake/5.25.0/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 111, in run_jobs
  File "/software/7/apps/snakemake/5.25.0/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 920, in run

I am worried that dentist/snakemake is not "seeing" the cluster.yml file which lists memory for the workflow. How can I fix this so that I can run dentist to completion?

My stderr, script, config.yaml, cluster.yml and snakemake.yml files are attached.
stderr.txt
Dentist_Script.txt
config.yaml.txt
cluster.txt
snakemake.txt

no gap-filling in "scaffold" mode?

Hi again!

I have now run dentist successfully a few times and tested the different join policies, using as input the raw contig assembly or a scaffold assembly, scaffolded with LRscaf.

I get the best N50 running dentist with join-policy: scaffolds on the already scaffolded assembly (13.5 Mb). However, the final gap-closed.fasta contains almost as many Ns (186183) as the input assembly reference.fasta (186185).

When running with join-policy: contigs, neither the reference.fasta nor the gap-closed.fasta contain any Ns.

Is it intended that gaps are not closed in the join-policy:scaffold-mode? Do I have to run dentist a second time with join-policy: scaffoldGaps to actually glose the gaps in the 13.5 Mb assembly?

As far as I can tell, all dentist runs finished without any errors.

Here are some stats of the different assemblies:

`join-policy: scaffolds`

file              format  type  num_seqs      sum_len  min_len      avg_len     max_len     Q1      Q2         Q3  sum_gap         N50  Q20(%)  Q30(%)
gap-closed.fasta  FASTA   DNA        117  340,348,369    1,843  2,908,960.4  28,227,288  9,395  77,974  2,243,295        0  13,556,940       0       0
reference.fasta   FASTA   DNA        122  340,337,486      940  2,789,651.5  28,227,288  8,609  89,797  2,243,295        0  12,859,076       0       0

`join-policy: contigs`

file              format  type  num_seqs      sum_len  min_len      avg_len     max_len     Q1      Q2         Q3  sum_gap         N50  Q20(%)  Q30(%)
gap-closed.fasta  FASTA   DNA        161  303,401,106    1,069  1,884,478.9  28,227,288  7,516  31,549    988,561        0  10,239,100       0       0
reference.fasta   FASTA   DNA        171  303,403,113      520  1,774,287.2  25,232,339  7,314  31,549  971,039.5        0   9,931,962       0       0

Failed to run with test data set: TypeError in line 45 of Snakefile

(base) ubuntu@bio-xanthomonas:~/dentist-example$ snakemake --configfile=snakemake.yml --use-singularity --cores=all
Pre-fetching singularity image...
TypeError in line 45 of /home/djs217/dentist-example/Snakefile:
__init__() got an unexpected keyword argument 'is_containerized'
  File "/home/djs217/dentist-example/Snakefile", line 745, in <module>
  File "/home/djs217/dentist-example/Snakefile", line 45, in prefetch_singularity_image

(base) ubuntu@bio-xanthomonas:~/dentist-example$ snakemake --version
5.32.1

dentist collect v1.0.0-beta.3 requires DAScover from DASCRUBBER which failed to compile.

dentist collect v1.0.0-beta.3 requires DAScover from DASCRUBBER. DAScover.c failed to compile.

DAScover.c:119:27: error: ‘DB_CSS’ undeclared (first use in this function)
if ((Reads[j].flags & DB_CSS) == 0)

(snakemake) [randy.bradley@ceres dentist-test]$ cat logs/collect.log
Error: missing external tools:

DAScover (part of DASCRUBBER; see https://github.com/thegenemyers/DASCRUBBER)
DASqv (part of DASCRUBBER; see https://github.com/thegenemyers/DASCRUBBER)

Check your PATH and/or install the required software.

DALIGNER version

Hi,

I'm getting a lot of LAcheck errors when I try running dentist (via singularity).

Your Snakefile reports the following message in the logs

Duplicate LAs: can be fixed by LAsort from 2020-03-22 or later.

But your dependencies section in the README has DALIGNER (=2020-01-15) i.e. git brach c2b47da6b3c94ed248a6be395c5b96a4e63b3f63 (which is used in your docker recipe)

Is dentist tied to DALIGNER version 2020-01-15 or can the bug-fixed version from 2020-03-22 be used?

Singularity image is lacking a Python installation

Thanks to one of my users, I was made aware that the Singularity image should contain a Python installation to execute the Python rules.

Rerun after stop due to time limit

Hello,

I am ran the snakemake pipeline from on a slurm cluster in a single job using sbatch.
It stopped due to time limit and now when I try to rerun it, it just list a bunch of job and stops again without clear explanation.
Any idea what is going on ?

Here is my script

snakemake --configfile=snakemake.yml --use-conda --cores=all --rerun-incomplete --unlock
snakemake --configfile=snakemake.yml --use-conda --cores=all --rerun-incomplete

Adding haplotigs back for gap filling?

Dear Arne,

Thank you again for your help with setting up dentist on our cluster. I have one more conceptual question before giving it a try which may be better asked in a separate thread:

I am trying to use dentist to fill gaps in a PacBio assembly after Hi-C scaffolding. The sequenced plant was highly heterozygous, and the initial assembly contained both contigs and haplotigs, which represent alternate haplotypes at heterozygous loci. I purged all the haplotigs before scaffolding, so the current assembly is a haploid representation, while the raw PacBio reads to be mapped for gap filling are obviously from the diploid genome.

I am thus wondering whether this could lead to incorrect gap filling if reads from both haplotypes map to the same assembly gap. Would it be best to add the haplotigs again for gap filling?

I assume that the haplotigs could be removed again after gap filling as long as the option to merge contigs is not enabled?

Best,
Roman

Singularity issue

Hi,

first time using singularity. Is this supposed to happen or am i supposed to do something else?

$ singularity --debug shell docker://aludi/dentist:latest
DEBUG   [U=2001,P=20764]   persistentPreRun()            Singularity version: 3.6.3
DEBUG   [U=2001,P=20764]   persistentPreRun()            Parsing configuration file /ceph/users/dlaetsch/.conda/envs/singularity/etc/singularity/singularity.conf
DEBUG   [U=2001,P=20764]   handleConfDir()               /ceph/users/dlaetsch/.singularity already exists. Not creating.
DEBUG   [U=2001,P=20764]   getCacheParentDir()           environment variable SINGULARITY_CACHEDIR not set, using default image cache
DEBUG   [U=2001,P=20764]   parseURI()                    Parsing docker://aludi/dentist:latest into reference
FATAL   [U=2001,P=20764]   replaceURIWithImage()         Unable to handle docker://aludi/dentist:latest uri: failed to get checksum for docker://aludi/dentist:latest: Error reading manifest latest in docker.io/aludi/dentist: manifest unknown: manifest unknown

cheers,

dom

Example fails with Colon expected after keyword global_conda. (Snakefile, line 103)

Trying to run the example on a fresh Ubuntu 22.04 install. I ran the following and get the following error:

# install Dentist
mamba create -n dentist -c a_ludi -c bioconda dentist-core
mamba activate dentist
mamba install -c conda-forge -c bioconda snakemake

#Get example data
wget https://github.com/a-ludi/dentist/releases/download/v4.0.0/dentist-example.tar.gz
tar -xzf dentist-example.tar.gz
cd dentist-example

# run the workflow
PATH="$PWD/bin:$PATH" snakemake --configfile=snakemake.yml --cores=all

I get the error:

Colon expected after keyword global_conda. (Snakefile, line 103)

Any suggestions?

Error in rule `collect`

Hi,

Following #16 I tried the pipeline with SKIP_LACHEK=1 and now I have this error

Error in rule collect:
    jobid: 5
    output: workdir/pile-ups.db
    log: logs/collect.log (check log file(s) for error message)
    shell:
        dentist collect --config=dentist.json  --threads=4 --auxiliary-threads=2 --mask=dentist-self-H,tan-H,dentist-reads-H workdir/scaffolds_FINAL.dam workdir/non-hifi.1kb.db workdir/scaffolds_FINAL.non-hifi.1kb.las workdir/pile-ups.db 2> logs/collect.log
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    cluster_jobid: 370318 logs/cluster/collect/unique/jobid5_4e09f197-d4f2-4b34-83cf-bac0967aa03c.out

Error executing rule collect on cluster (jobid: 5, external: 370318 logs/cluster/collect/unique/jobid5_4e09f197-d4f2-4b34-83cf-bac0967aa03c.out, jobscript: /lustre/scratch116/tol/teams/team308/users/mm49/tmp/non-hifi-reads2/.snakemake/tmp.vs0le48c/snakejob.collect.5.sh). For error details see the cluster log and the log files of the involved rule(s).

logs/collect.log is empty. 370318 logs/cluster/collect/unique/jobid5_4e09f197-d4f2-4b34-83cf-bac0967aa03c.out seems to be containing the output of a snakemake pipeline that has this error:

Error in rule propagate_mask_back_to_reference_block:
    jobid: 946
    output: workdir/.scaffolds_FINAL.dentist-self-H-257B.anno, workdir/.scaffolds_FINAL.dentist-self-H-257B.data
    log: logs/propagate-mask-back-to-reference-block.dentist-self.257.log (check log file(s) for error message)
    shell:
        dentist propagate-mask --config=dentist.json  -m dentist-self-257B workdir/non-hifi.1kb.db workdir/scaffolds_FINAL.dam workdir/non-hifi.1kb.257.scaffolds_FINAL.las dentist-self-H-257B 2> logs/propagate-mask-back-to-reference-block.dentist-self.257.log
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

And logs/propagate-mask-back-to-reference-block.dentist-self.257.log has this error:

core.exception.AssertError@/etc/../usr/include/dmd/phobos/std/range/primitives.d(2434): Attempting to fetch the front of an empty array of FlatLocalAlignment
----------------
??:? _d_assert_msg [0x55a59fc89657]
??:? dentist.common.alignments.base.FlatLocalAlignment[] std.algorithm.mutation.copy!(std.algorithm.iteration.ChunkByChunkImpl!(dentist.commands.propagateMask.MaskPropagator.getLocalAlignmentsByContig().__lambda1, dentist.dazzler.LocalAlignmentReader).ChunkByChunkImpl, dentist.common.alignments.base.FlatLocalAlignment[]).copy(std.algorithm.iteration.ChunkByChunkImpl!(dentist.commands.propagateMask.MaskPropagator.getLocalAlignmentsByContig().__lambda1, dentist.dazzler.LocalAlignmentReader).ChunkByChunkImpl, dentist.common.alignments.base.FlatLocalAlignment[]) [0x55a59f7ae2e8]
??:? dentist.common.alignments.base.FlatLocalAlignment[] dentist.commands.propagateMask.MaskPropagator.getLocalAlignmentsByContig().bufferChunks!(std.algorithm.iteration.ChunkByChunkImpl!(dentist.commands.propagateMask.MaskPropagator.getLocalAlignmentsByContig().__lambda1, dentist.dazzler.LocalAlignmentReader).ChunkByChunkImpl).bufferChunks(std.algorithm.iteration.ChunkByChunkImpl!(dentist.commands.propagateMask.MaskPropagator.getLocalAlignmentsByContig().__lambda1, dentist.dazzler.LocalAlignmentReader).ChunkByChunkImpl, ulong) [0x55a59f8e93a0]
??:? dentist.common.alignments.base.FlatLocalAlignment[] dentist.commands.propagateMask.MaskPropagator.getLocalAlignmentsByContig().__lambda2!(std.algorithm.iteration.ChunkByChunkImpl!(dentist.commands.propagateMask.MaskPropagator.getLocalAlignmentsByContig().__lambda1, dentist.dazzler.LocalAlignmentReader).ChunkByChunkImpl).__lambda2(std.algorithm.iteration.ChunkByChunkImpl!(dentist.commands.propagateMask.MaskPropagator.getLocalAlignmentsByContig().__lambda1, dentist.dazzler.LocalAlignmentReader).ChunkByChunkImpl) [0x55a59f8e932d]
??:? @property dentist.common.alignments.base.FlatLocalAlignment[] std.algorithm.iteration.MapResult!(dentist.commands.propagateMask.MaskPropagator.getLocalAlignmentsByContig().__lambda2, std.algorithm.iteration.ChunkByImpl!(dentist.commands.propagateMask.MaskPropagator.getLocalAlignmentsByContig().__lambda1, dentist.dazzler.LocalAlignmentReader).ChunkByImpl).MapResult.front() [0x55a59f8e94c8]
??:? @property dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[] std.algorithm.iteration.MapResult!(dentist.commands.propagateMask.MaskPropagator.run().__lambda1, std.algorithm.iteration.MapResult!(dentist.commands.propagateMask.MaskPropagator.getLocalAlignmentsByContig().__lambda2, std.algorithm.iteration.ChunkByImpl!(dentist.commands.propagateMask.MaskPropagator.getLocalAlignmentsByContig().__lambda1, dentist.dazzler.LocalAlignmentReader).ChunkByImpl).MapResult).MapResult.front() [0x55a59f8e97be]
??:? dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[][] std.algorithm.mutation.copy!(std.algorithm.iteration.MapResult!(dentist.commands.propagateMask.MaskPropagator.run().__lambda1, std.algorithm.iteration.MapResult!(dentist.commands.propagateMask.MaskPropagator.getLocalAlignmentsByContig().__lambda2, std.algorithm.iteration.ChunkByImpl!(dentist.commands.propagateMask.MaskPropagator.getLocalAlignmentsByContig().__lambda1, dentist.dazzler.LocalAlignmentReader).ChunkByImpl).MapResult).MapResult, dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[][]).copy(std.algorithm.iteration.MapResult!(dentist.commands.propagateMask.MaskPropagator.run().__lambda1, std.algorithm.iteration.MapResult!(dentist.commands.propagateMask.MaskPropagator.getLocalAlignmentsByContig().__lambda2, std.algorithm.iteration.ChunkByImpl!(dentist.commands.propagateMask.MaskPropagator.getLocalAlignmentsByContig().__lambda1, dentist.dazzler.LocalAlignmentReader).ChunkByImpl).MapResult).MapResult, dentist.util.region.Region!(ulong, ulong, "contigId", 0uL).Region.TaggedInterval[][]) [0x55a59f7ae725]
??:? void dentist.commands.propagateMask.MaskPropagator.run() [0x55a59f8e78ec]
??:? dentist.commandline.ReturnCode dentist.commandline.runCommand!(3).runCommand(in immutable(char)[][]) [0x55a59f816cbf]
??:? dentist.commandline.ReturnCode dentist.commandline.run(in immutable(char)[][]) [0x55a59f7e2e98]
??:? _Dmain [0x55a59f673704]

Full log: propagate-mask-back-to-reference-block.dentist-self.257.log

a-ludi / dentist Goto Github PK

dentist's Introduction

DENTIST

Table of Contents

Install

Use Conda via Snakemake (recommended)

Use Conda to Manually Install Binaries

Use Pre-Built Binaries

Use a Singularity Container (discouraged)

Build from Source

Runtime Dependencies

Usage

Quick execution with Snakemake

Executing on a Cluster

Manual execution

Example

Local Execution

Execution with Conda

Execution in Singularity Container (discouraged)

Cluster Execution

Configuration

How to Choose DENTIST Parameters

Choosing the Read Type

Cluster/Cloud Execution

Troubleshooting

Regular ProtectedOutputException

No internet connection on compute nodes

Illegally formatted line from DBshow -n

Citation

Maintainer

Contributing

License

dentist's People

Contributors

Stargazers

Watchers

Forkers

dentist's Issues

Succesfull node (high memory)

Config

All other nodes

join-policy: scaffolds

join-policy: contigs

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Regular `ProtectedOutputException`

Illegally formatted line from `DBshow -n`

`join-policy: scaffolds`

`join-policy: contigs`