GithubHelp home page GithubHelp logo

ust's Introduction

UST

UST is a bioinformatics tool for constructing a spectrum-preserving string set (SPSS) representation from sets of k-mers.

Note: This software has been subsumed by ESSCompress. To use UST, download ESSCompress and follow the UST instructions in the README.

Requirements

GCC >= 4.8 or a C++11 capable compiler

Quick start

To install, compile from source:

git clone https://github.com/medvedevgroup/UST
cd UST
make

After compiling, use

./ust -i [unitigs.fa] -k [kmer_size]

e.g.

./ust -i examples/k11.unitigs.fa -k 11

The important parameters are:

  • k [int] : The k-mer size that was used to generate the input, i.e. the length of the nodes of the node-centric de Bruijn graph.
  • i [input-file] : Unitigs file produced by BCALM2 in FASTA format.
  • a [0 or 1] : Default is 0. A value of 1 tells UST to preserve abundance. Use this option when the input file was generated with the -all-abundance counts option of BCALM2.

The output is a FASTA file with extenstion "ust.fa" in the working folder, which is the SPSS representaiton of the input. If the program is run with the option -a 1, an additional count file with extension "ust.counts" will also be generated.

Detailed Usage

In order to build a SPSS representation for your k-mer set, you must first run BCALM2 on your set of k-mers. BCALM2 will construct a set of unitigs. Those unitigs are then fed as input to ust, which outputs a FASTA file with the SPSS representation. Note that the k parameter to ust must match the -kmer-size used when running BCALM2.

If you would like to store the data on disk in compressed form (like UST-Compress in our paper), you can then install and run MFCompress on the output of UST as follows: MFCompressC mykmers.ust.fa

If you would like to build a membership data structure based on UST, then

  • Install bwtdisk and dbgfm.
  • Change the two variables "DBGFM_DIRECTORY" and "BWTDISK_DIRECTORY" in the script ust-fm.sh to point to the locations where dbgfm and bwtdisk are installed. Alternatively, you can add the path to both tools in your environment PATH variable and then modify the script accordingly.
  • Run ust-fm.sh as follows: ust-fm.sh mykmers.ust.fa

Citation

If using UST in your research, please cite

@inproceedings{RahmanMedvedevRECOMB20,
  author    = {Amatur Rahman and Paul Medvedev},
  title     = {Representation of $k$-mer sets using spectrum-preserving string sets},
  booktitle = {Research in Computational Molecular Biology - 24th Annual International Conference, {RECOMB} 2020, Padua, Italy, May 10-13, 2020, Proceedings},
  series    = {Lecture Notes in Computer Science},
  volume    = {12074},
  pages     = {152--168},
  publisher = {Springer},
  year      = {2020
}

Note that the general notion of an SPSS was independently introduced under the name of simplitigs. Therefore, if citing this general notion, please also cite:

ust's People

Contributors

amatur avatar pashadag avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ust's Issues

Counts seems incorrect

Hi!
I'm a computer engineering student and I'm doing my master thesis on improving UST basically (see here if interested).

I wrote a simple C++ program that extracts canonical kmers from simplitigs and appends sequentially its counts using UST output files.
Then I sorted the kmers list and compared to the one computed by Jellyfish-2.

There are difference between counts, though kmers are the same. Can you confirm this?

How to reproduce

Extract kmers and counts from ust output files:

  • g++ kmers-extractor.cpp -o kmers-extractor
  • ./kmers-extractor <kmer-size> <ust-fasta> <ust-counts>
  • sort ust-kmers.txt -o ust-kmers-sorted.txt

Extract kmers and counts from starting sequence (not the bcalm one):

  • jellyfish-linux count -m <kmer-size> -C -s 100M -L 2 <starting-fasta>
  • jellyfish-linux dump -c mer_counts.jf > kmers.txt
  • sort kmers.txt -o kmers-sorted.txt

Compare the two files:

  • cmp kmers-sorted.txt ust-kmers-sorted.txt

kmers-extractor is attached.

Note that kmers with abundance 1 are ignored.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.