smortezah / smashpp Goto Github PK

View Code? Open in Web Editor NEW

56.0 2.0 12.0 778.53 MB

Find and visualize rearrangements in DNA sequences

License: GNU General Public License v3.0

C++ 99.85% CMake 0.09% Shell 0.03% Dockerfile 0.02%

high-throughput-sequencing alignment-free genome-compression dna-sequences data-visualization genome-rearrangment

smashpp's People

Contributors

Stargazers

Watchers

Forkers

yuzhenpeng exonbits cobilab pratas mkyriak colindaven asilab pjm43 ahmedarslan tarsli jiangchb

smashpp's Issues

Segmentation fault during reference-free compression

Hi, I'm trying to run smash++ on a Mac (10.15.7, installed using homebrew), on two fairly small, fairly fragmented eukaryotic genomes (around 24MB each, 287 contigs and 501 contigs respectively, N50 180.5kb for the first one).

The test run I did using the ref and tar files in example/ worked fine, but with my two genomes I get a segmentation fault during the Ref-free compression of segment 4 (or in one case, which I haven't been able to replicate, segment 5). This happens regardless of whether I specify any options, or different values of -m or -l. The output files in all cases include .seq files for both genomes, and a series of sequentially labelled files named 0.genome1.fasta.genome2.fasta.s[number]

Is this likely to be a problem with the command/parameters I'm using, or a property of my data, or of my system? Thanks!

Recommandation for eukaryotic species

Hi,

are there any recommandation for eukaryotic species?

I am currently comparing two highly similar eukaryotic genome sequences, but get no synteny nor any rearrangements at all?

wget http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
wget http://ftp.ensembl.org/pub/current_fasta/pan_troglodytes/dna/Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa.gz
gunzip Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa.gz
smashpp -n 32 -m 5000 -f 10000 -fs L -r Homo_sapiens.GRCh38.dna.primary_assembly.fa -t Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa

The results are empty, however I would expect to see some differences between human and chimp.

====[ PREPARE DATA ]==================================
[+] Homo_sapiens.GRCh38.dna.primary_assembly.fa (FASTA) -> Homo_sapiens.GRCh38.dna.primary_assembly.seq (seq) finished.
[+] Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa (FASTA) -> Pan_troglodytes.Pan_tro_3.0.dna.toplevel.seq (seq) finished.

====[ REGULAR MODE ]==================================
[+] Creating model of Homo_sapiens.GRCh38.dna.primary_assembly.fa done.
[+] Filtering Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa done => 0 segments

====[ INVERTED MODE ]=================================
[+] Creating model of Homo_sapiens.GRCh38.dna.primary_assembly.fa done.
[+] Filtering Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa done => 0 segments

Thank you in anticipation

Best regards

Kristian

Paired_end sequence

I am working with a pairs of sequences in PE model, and I want to find the rerangements. How to process such data with your tool？
Thanks a lot!

WSL installation

Trying to install in WSL on a Win10 64bit professional OS
dougrhoads@ARSC-GP50LH2:$ git clone https://github.com/smortezah/smashpp.git
dougrhoads@ARSC-GP50LH2:$ cd smashpp
dougrhoads@ARSC-GP50LH2:/smashpp$ ./install.sh
-bash: ./install.sh: Permission denied
(base) dougrhoads@ARSC-GP50LH2:/smashpp$ sudo ./install.sh
[sudo] password for dougrhoads:
sudo: ./install.sh: command not found

Any suggestions?

[Question] interpreting smashpp output files

Hey @smortezah ,

i have troubles understanding the smashpp's output - if it produces an output file of the form:

{
    "watermark": "##SMASH++",
    "parameters": "<-r refrence.fa -t target.fa>",
    "reference": "refrence.fa",
    "reference_size": "1653432",
    "target": "target.fa",
    "target_size": "1016",
    "positions": [
        {
            "reference_begin": "321300",
            "reference_end": "322200",
            "reference_relative_redundancy": "0.7155",
            "reference_redundancy": "2.0000",
            "target_begin": "1016",
            "target_end": "180",
            "target_relative_redundancy": "0.6232",
            "target_redundancy": "2.0000",
            "inverted": "T",
        }
    ]
}

which corresponds to a target sequence of length 1000 like this:

>target
ATGAATCCAAATCAAATACTTGAAAATTTAAAAAAAGAATTAAGTGAAAACGAATACGAAAATTATATCGCTATCTTAAAATTTAACGAAAAACAAAGCAAAGCAGATTTTCTAGTCTTTAACGCTCCTAATGAGCTTTTAGCCAAATTCATACAAACAAAATACGGTAAAAAAATTTCACATTTTTATGAAGTACAAAGCGGAAATAAAGCGAGCGTTTTGATACAAGCACAAAGCCAAAAACATAGTAGCAAAAGCACTAAAATCGATATCGCTCACATCAAGGCGCAAAGTACGATTTTAAATCCTTCTTTTACTTTTGAAAGCTTTGTAGTGGGGGATTCTAACAAATACGCTTATGGAGCTTGTAAAGCTATCTCACAAAAAGACAAACTGGGAAAACTTTATAATCCTATCTTTATCTATGGGCCTACAGGGCTTGGAAAAACGCACTTGCTTCAAGCTGTGGGAAATGCAAGTTTGGAAATGGGAAAAAAAGTGATTTATGCTACGAGTGAAAATTTTATCAATGATTTTACTTCAAATTTAAAAAATGGCTCTTTAGATAAATTTCACGAAAAATATAGAAATTGTGATGTTTTACTCATAGATGATGTGCAATTTTTAGGAAAAACAGACAAAATTCAAGAAGAGTTTTTCTTTATATTTAATGAGATTAAAAATAATGATGGGCAAATTATCATGACAAGTGATAACCCACCCAATATGCTAAAAGGTATCACCGAACGCTTAAAAAGTCGTTTTGCTCATGGTATCATAGCAGATATCACTCCACCTCAACTGGATACAAAAATAGCCATCATACGCAAAAAATGCGAATTTAATGATATCAATCTTTCTAATGATATCATTAATTATATCGCCACTTCTTTAGGGGATAATATAAGAGAAATAGAAGGTATCATCATAAGCTTAAATGCTTATGCTAACATACTTGGACAAGAAATCACCCTTGAGCTAGCAAAAAGCGTGATGAAAG

could you help me understand where the fields "target_begin" and "target_end" point to?

it confuses me, that the begin is greater than the length of the sequence.

how can i convert the mentioned fields into actual coordinates of the target-sequence?

what i currently do is

from Bio import SeqIO
    target_start = int(seq_dict["target_begin"])
    target_end = int(seq_dict["target_end"])

    reference_start = int(seq_dict["reference_begin"])
    reference_end = int(seq_dict["reference_end"])

    if target_start > target_end:
        target_start, target_end = target_end, target_start

    inverted = seq_dict["inverted"] == "T"

    assert target_start < target_end

    target_subseq = target_seq[target_start:target_end]
    reference_subseq = reference_seq[reference_start:reference_end]


    if inverted:       
        target_subseq = target_subseq.reverse_complement(id=target_seq.id, name=target_seq.name, description=target_seq.description)

but the resulting subsequences don't seem to make sense...
it seem like the header of both the target.fasta and refrence.fasta shift the positions (if i change the headers it returns different outputs) in the sequences...

can you help me sort out the actual sequences that were found by the smash-tool?

thank you!

Errors generated during Installation on M1 MacBookPro

Hi, I keep getting the following message when I try to install:

2 errors generated.
make[2]: *** [CMakeFiles/smashpp.dir/main.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/smashpp.dir/all] Error 2
make: *** [all] Error 2

Any suggestions on how to resolve this?

About the multiple threads

Hi Morteza!
I really like smash++. The figures are amazing.
I have a question about -tr parameter. When I set it to >1 like 10 and 20 ,also defalut(4), it seems the software still use single thread. I have 30 cores in my server.

Song

Installation quits: clang not recognizing flags

Trying to install smashpp on MacOSX Mojave 10.14.6 (18G2022). Git (2.26.0) and homebrew (2.2.10) installed already.

clang --version returns:-
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Here's the ouput from running "sudo sh install.sh" or "sudo bash install.sh". The ./ method doesn't work for some reason.
-- The C compiler identification is AppleClang 10.0.1.10010046
-- The CXX compiler identification is AppleClang 10.0.1.10010046
-- Check for working C compiler: /Library/Developer/CommandLineTools/usr/bin/cc
-- Check for working C compiler: /Library/Developer/CommandLineTools/usr/bin/cc - works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /Library/Developer/CommandLineTools/usr/bin/c++
-- Check for working CXX compiler: /Library/Developer/CommandLineTools/usr/bin/c++ - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES)
-- Could NOT find OpenMP_CXX (missing: OpenMP_CXX_FLAGS OpenMP_CXX_LIB_NAMES)
-- Could NOT find OpenMP (missing: OpenMP_C_FOUND OpenMP_CXX_FOUND)
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/sukrit/smashpp/build
Scanning dependencies of target smashpp
[ 14%] Building CXX object CMakeFiles/smashpp.dir/fcm.cpp.o
[ 14%] Building CXX object CMakeFiles/smashpp.dir/logtbl8.cpp.o
[ 35%] Building CXX object CMakeFiles/smashpp.dir/tbl32.cpp.o
[ 42%] Building CXX object CMakeFiles/smashpp.dir/cmls4.cpp.o
[ 42%] Building CXX object CMakeFiles/smashpp.dir/tbl64.cpp.o
[ 42%] Building CXX object CMakeFiles/smashpp.dir/par.cpp.o
[ 50%] Building CXX object CMakeFiles/smashpp.dir/filter.cpp.o
[ 57%] Building CXX object CMakeFiles/smashpp.dir/segment.cpp.o
clang: error: unknown argument: '-fbranch-target-load-optimize'
clang: clang: error: error: unknown argument: '-fbranch-target-load-optimize'unknown argument: '-fbranch-target-load-optimize'

clang: error: unknown argument: '-fbranch-target-load-optimize'
clang: error: unknown argument: '-fbranch-target-load-optimize'
clang: error: unknown argument: '-fbranch-target-load-optimize'
clang: error: unknown argument: '-fbranch-target-load-optimize'
make[2]: *** [CMakeFiles/smashpp.dir/cmls4.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [CMakeFiles/smashpp.dir/fcm.cpp.o] Error 1
make[2]: *** [CMakeFiles/smashpp.dir/filter.cpp.o] Error 1
make[2]: *** [CMakeFiles/smashpp.dir/logtbl8.cpp.o] Error 1
make[2]: *** [CMakeFiles/smashpp.dir/tbl64.cpp.o] Error 1
make[2]: *** [CMakeFiles/smashpp.dir/tbl32.cpp.o] Error 1
make[2]: *** [CMakeFiles/smashpp.dir/par.cpp.o] Error 1
clang: error: unknown argument: '-fbranch-target-load-optimize'
make[2]: *** [CMakeFiles/smashpp.dir/segment.cpp.o] Error 1
make[1]: *** [CMakeFiles/smashpp.dir/all] Error 2
make: *** [all] Error 2

Can Smash++ take reverse complement into account?

I was using Smash++ to identify rearrangements between a reference and a target genome. Unknown to me, the target sequence was actually on the - strand in the 3'-5' direction. It was telling me my entire sequence was inverted so I tried the reverse complement and got the anticipated result. Would there be any way that Smash++ could take reverse complement into account?

Input files are deleted

I just installed smashpp via conda and executed:

smashpp -r ../2020-05-22_annotation/ontOnly_flye/01-Assembly/foo.fasta -t ../2020-05-22_annotation/hybrid_flye_pilon/01-Assembly/fii.fasta -n 6

that reported

(smashpp) ➜     
====[ PREPARE DATA ]==================================
[+] foo.fasta (FASTA) -> foo.seq (seq) finished.
[+] fii.fasta (FASTA) -> fii.seq (seq) finished.

====[ REGULAR MODE ]==================================
[+] Creating model of foo.fasta done.
[+] Filtering fii.fasta done => 11 segments
[+] Ref-free compression of all segments done.         
[+] Repeating above process for all segments done.

====[ INVERTED MODE ]=================================
[+] Creating model of foo.fasta done.
[+] Filtering fii.fasta done => 0 segments

Error: the file
"../2020-05-22_annotation/ontOnly_flye/01-Assembly/foo.fasta"
cannot be opened or is empty.

and then deleted my input files foo.fasta and fii.fasta!

It works when I run the same command directly on the two files by first copying them to the folder where I execute smashpp.

That's a dangerous bug!

parametrization for analyzing/visualizing fusion/fission in mammalian genomes

Hi,

I would like to visualize the synteny and rearrangements between a few (mammalian) genomes.

While the 'classic' circos-type layout with alignments using mummer works, I think smash/smashpp has benefits and I would like to give it a whirl.

Running smashpp with the default settings does not lead to any output. Do you have suggestions regarding which parameters to tweak? The assemblies I am using are chromosome-level ( but short unplaced scaffolds unfiltered).

Thanks!

conda installed smashpp crashes

This is using the latest conda with python 3.9

lp0105:~$ smashpp
Illegal instruction (core dumped)

in dmesg:
[26008400.796227] traps: smashpp[613] trap invalid opcode ip:55e5e5e0c9e3 sp:7ffc9a45b5c0 error:0 in smashpp[55e5e5e00000+9a000]

Not sure what the issue is.

Mac OSX M11 Installation Bug

During installation, my computer threw the error "clang: error: the clang compiler does not support '-march=native'".

When I removed "-march=native" from the CMakeLists.txt file, this fixed the error.

How to optimize for my genome of interest???

Hi,
Really enjoyed reading your paper and I though smashpp would be really useful for some of my investigation. I tested your program on viral sequences of interest and I feel I am not getting the results I expected. Briefly, seems like smash is not recognizing many parts of the two viral genomes compared and producing visualization connecting all the viral blocks. I am probably not setting the program parameters correctly. I wanted to ask how to optimize the smashpp parameters to my viral sequences of interest. I ran smashpp with default settings and could not figure out how to tweak the model parameters. I also ran MUAVE and got what I expected, have an image attached. Could you please help me get similar results for smashpp? Example files attached. Your help much appreciated.

info:

comparing reference HBV genome (gtC.fasta) to integrated HBV DNA in human (seq_sample1.fasta)
divergence could be up to 15-20% for viral portion
the integrated HBV DNA fasta file (seq_sample1.fasta) has part viral and part human

Sincerely, Ondrej
SMASH_example.zip

No output for small bacterial genomes

Hi!

I am trying to visualize rearrangements for two small bacterial genomes (~ 1Mbp). But even while tweaking the parameters a little bit (-m, -l, ...) I don't get any output:

(smashpp) ➜  smashpp smashpp -r 14-2711_R47.fasta -t Chlamydia_psittaci_6BC.fasta
====[ PREPARE DATA ]==================================
[+] 14-2711_R47.fasta (FASTA) -> 14-2711_R47.seq (seq) finished.
[+] Chlamydia_psittaci_6BC.fasta (FASTA) -> Chlamydia_psittaci_6BC.seq (seq) finished.

====[ REGULAR MODE ]==================================
[+] Creating model of 14-2711_R47.fasta done.
[+] Filtering Chlamydia_psittaci_6BC.fasta done => 0 segments

====[ INVERTED MODE ]=================================
[+] Creating model of 14-2711_R47.fasta done.
[+] Filtering Chlamydia_psittaci_6BC.fasta done => 0 segments

Total time: 3 sec.

So it seems Smashpp does not find any fragments/rearragnements between the two?

This is how a MAUVE output for the same sequences looks like:

Maybe I am not using the parameters correctly?

No output for two WGS analyzed...

Hi,
I am currently having no output for two human WGS 100x samples analyzed with smash++.
Can you tell me if there is some error or additional parameters I should use?
Thanks

This is how I executed it:

Since I have Illumina fastq sequencing with paired end, there are two files instead of one needed in input by smash++. Thus, I am merging paired ends in a unique file with bwa-kit mergepe:

/projects/bin/bwa.kit/seqtk mergepe sample_R1.fastq.gz sample_R2.fastq.gz > PD-7107_S4.fastq

launching smash++ with default parameters from within a container:

singularity run /projects/bin/SMASH++_container.sif /sw/smashpp/smashpp
-r /projects/DATA/references/genome/Homo_sapiens_assembly38.fasta
-t /projects/PD-7107_S4.fastq -n 20

launching smash++ with parameters:
singularity run /projects/bin/SMASH++_container.sif /sw/smashpp/smashpp
-r /projects/DATA/references/genome/Homo_sapiens_assembly38.fasta
-t /projects/PD-7107_B.fastq -l 0 -m 1000 -n 20

First LOG (2.):

====[ PREPARE DATA ]==================================
[+] Homo_sapiens_assembly38.fasta (FASTA) -> Homo_sapiens_assembly38.seq (seq) finished.
[+] PD-7107_S4.fastq (FASTQ) -> PD-7107_S4.seq (seq) finished.

====[ REGULAR MODE ]==================================
[+] Creating model of Homo_sapiens_assembly38.fasta done.
[+] Filtering PD-7107_S4.fastq done => 8543 segments
[+] Ref-free compression of segment 14 ...
[+] Ref-free compression of segment 28 ...
....
[+] Ref-free compression of segment 8536 ...
[+] Ref-free compression of all segments done.
[+] Repeating above process for segment 12 ...
....
====[ PREPARE DATA ]==================================
[+] Homo_sapiens_assembly38.fasta (FASTA) -> Homo_sapiens_assembly38.seq (seq) finished.
[+] PD-7107_S4.fastq (FASTQ) -> PD-7107_S4.seq (seq) finished.

====[ REGULAR MODE ]==================================
[+] Creating model of Homo_sapiens_assembly38.fasta done.
[+] Filtering PD-7107_S4.fastq done => 8543 segments
[+] Ref-free compression of segment 14 ...
[+] Ref-free compression of segment 28 ...
...
[+] Ref-free compression of segment 8536 ...
[+] Ref-free compression of all segments done.
[+] Repeating above process for segment 12 ...
[+] Repeating above process for segment 25 ...
...
[+] Repeating above process for segment 8534 ...
[+] Repeating above process for all segments done.

====[ INVERTED MODE ]=================================
[+] Creating model of Homo_sapiens_assembly38.fasta done.
[+] Filtering PD-7107_S4.fastq done => 0 segments

Total time: 192:10:27 hour:min:sec.

First OUTPUT FILE:

[fmusacchia@me smash]$ more Homo_sapiens_assembly38.fasta.PD-7107_S4.fastq.pos
##SMASH++
##PARAM=<-r /projects//DATA/references/genome/Homo_sapiens_assembly38.fasta -t /projects/PD-7107_S4.fastq -n 20>
##INFO=<Ref=Homo_sapiens_assembly38.fasta,RefSize=3249912778,Tar=PD-7107_S4.fastq,TarSize=1031281362254>
#RBeg REnd RRelRdn RRdn TBeg TEnd TRelRdn TRdn Inv

As you can see it took 192 hours but gave no results.

Second LOG (3.)

====[ PREPARE DATA ]==================================
[+] Homo_sapiens_assembly38.fasta (FASTA) -> Homo_sapiens_assembly38.seq (seq) finished.
[+] PD-7107_B.fastq (FASTQ) -> PD-7107_B.seq (seq) finished.

====[ REGULAR MODE ]==================================
[+] Creating model of Homo_sapiens_assembly38.fasta done.
[+] Filtering PD-7107_B.fastq done => 1057 segments
[+] Ref-free compression of segment 14 ...
[+] Ref-free compression of segment 28 ...
...
[+] Ref-free compression of segment 1051 ...
[+] Ref-free compression of all segments done.
[+] Repeating above process for segment 12 ...
...
[+] Repeating above process for segment 1055 ...
[+] Repeating above process for all segments done.

====[ INVERTED MODE ]=================================
[+] Creating model of Homo_sapiens_assembly38.fasta done.
[+] Filtering PD-7107_B.fastq done => 1360 segments
[+] Ref-free compression of segment 14 ...
...
[+] Ref-free compression of segment 1348 ...
[+] Ref-free compression of all segments done.
[+] Repeating above process for segment 12 ...
[+] Repeating above process for segment 25 ...
...
[+] Repeating above process for segment 1359 ...
[+] Repeating above process for all segments done.

Total time: 110:38:21 hour:min:sec.

Second OUTPUT:

[fmusacchia@me smash]$ more Homo_sapiens_assembly38.fasta.PD-7107_B.fastq.pos
##SMASH++
##PARAM=<-r /projects//DATA/references/genome/Homo_sapiens_assembly38.fasta -t /projects/PD-7107_B.fastq -l 0 -m 1000 -n 20>
##INFO=<Ref=Homo_sapiens_assembly38.fasta,RefSize=3249912778,Tar=PD-7107_B.fastq,TarSize=935069756484>
#RBeg REnd RRelRdn RRdn TBeg TEnd TRelRdn TRdn Inv

Does this can deal with mutiple genomes?

Dear the authors,
Thanks for devoloping the useful program. I have a question that does this software can deal with mutiple genomes ? We just want to find the rearrangements accross 10 genomes and how should I do?
Thanks again and looking forward to your reply.

smortezah / smashpp Goto Github PK

smashpp's People

Contributors

Stargazers

Watchers

Forkers

smashpp's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs