GithubHelp home page GithubHelp logo

maxibor / adrsm Goto Github PK

View Code? Open in Web Editor NEW
5.0 2.0 1.0 15.07 MB

Ancient DNA Read Simulator for Metagenomic

License: MIT License

Python 5.59% HTML 92.83% Jupyter Notebook 1.34% Shell 0.02% Makefile 0.22%
genomics sequencing dna reads simulation adna metagenomics benchmark

adrsm's Introduction

Hi there πŸ‘‹

I am Maxime, a PostDoctoral Researcher in Bioinformatics for metagenomics and ancient DNA, based at the Max Planck Institute for Evolutionary Anthropology in Germany and the Leibniz Institut for Natural Product Research and Infection Biology Hans KnΓΆll Institute.

adrsm's People

Contributors

jfy133 avatar maxibor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

jfy133

adrsm's Issues

Add option for running offline

Some clusters may have computing nodes that are offline.

@maxibor says that currently adrsm uses:

(Using the JGI API to convert TAXID to species name)

Therefore errors such as

requests.exceptions.ConnectionError: HTTPConnectionPool(host='taxonomy.jgi-psf.org', port=80): Max retries exceeded with url: /tax/pt_name/GCA_013267415 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x2b5d41b17668>: Failed to establish a new connection: [Errno 110] Connection timed out',))

can happen, if no external internet access is avaliable.

An option to run the tool offline (e.g. using ete3 toolkit) would might be helpful for the above use-case.

Offer option to limit number of output reads

Another factor that could be considered when generating in silico aDNA data is sequencing effort.

Currently adrsm will generate sequencing reads that are equivalent to a 'true sample', i.e., has the original full genomes at the requested genomic depth.

However, sequencing experiments themselves have a fixed limit which is the capacity of the machine (or the amount of DNA actually input into the machine).

I would like to request the option for adrsm to output a user-defined set number of reads. For example, only have 10 million pairs in the output files. To simulate sequencing, these reads should be sampled randomly from within the initially sheared reads (which should have sampling probability represented by varying genome coverage).

Provide informative error when UPAC code hit and tool fails.

Some of the 'genomes' I downloaded and tried to input were actually unfinished assemblies:

e.g. GCA_001057035.1_ASM105703v1_genomic.fna

These contained symbols such as W/R/S in the sequence itself.

I would suggest that either if this is encountered that adrsm reports a nice error (rather than traceback) stating this isn't accepted, and please fix.

You could also suggest the following command to clean it up:

sed -i '/^[^>]/s/[R|Y|W|S|M|K|H|B|V|D]/N/g' *.fna

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.