GithubHelp home page GithubHelp logo

gtdblite's People

Contributors

askars avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

gtdblite's Issues

IMG metadata

It would be good to incorporate all metadata available at IMG into the GTDB. This can be download from the IMG website as a CSV file. It is a bit of a pain to download it, but it is possible.

16S top hit

It would be extremely helpful to have 16S blast hits for all genomes in the GTDB. This should include the alignment length, % identity, e-value, taxonomy string, database identifier, and NCBI accession number of the top hit. It would also be nice to have this information for the 2nd best hit. Perhaps doing homology search against the latest SILVA database with Phil's Greengenes taxonomy mapped to this dataset would be best.

Improved marker genes selection for phylogenetic inference

When a genome has multiple copies of a marker genes to be used for phylogenetic inference, one is currently selected (best e-value?). In practice, this may not be the correct marker gene for the genome and could be contamination. It would be better to simply ignore any marker genes that are identified more than once. This is the approach currently taken by CheckM.

default privacy setting should be public

It would likely be best if the default privacy setting for creating marker lists or adding genomes was public. In general, we are looking to share our data across ACE and build synergy between different projects. Only in some rare cases would I expect a user to need/want to make their data private.

Export genomes as fasta files

It would be useful to allow users to export a list of genomes as fasta files. Alternatively, we can just give users read access to the directory of population genomes. Either way, people need some way to access these things! :)

Better error reporting when making trees

When prodigal isn't in your path you get a very vague error message, which due to the parallel nature of the code points to an uninformative part of the source (see below). A check for the required external dependancies before getting into this would help a lot.

src/gtdblite.py trees create --output ~/test_genome_tree --all_genomes --marker_set_ids 3 --no_tree
24377 genomes contain 975080 uncalculated markers.
These markers need to be calculated in order to build the tree. More markers means more waiting. Continue using 1 threads? (y/N): y
Breaking calculation into 49 chunks of up to 500 genomes.
Calculating chunk 1 of 49....
Prodigal complete for 102 of 500 genomes (chunk 1 of 49),
Prodigal complete for 189 of 500 genomes (chunk 1 of 49),
Prodigal complete for 276 of 500 genomes (chunk 1 of 49),
Prodigal complete for 363 of 500 genomes (chunk 1 of 49),
Prodigal complete for 449 of 500 genomes (chunk 1 of 49),
Prodigal complete for 500 of 500 genomes (chunk 1 of 49),
Exception caught. Dumping info.
Traceback (most recent call last):
  File "src/gtdblite.py", line 664, in <module>
    result = args.func(db, args)
  File "src/gtdblite.py", line 128, in CreateTreeData
    return db.MakeTreeData(marker_id_list, genome_id_list, args.out_dir, "gtdblite", args.profile, profile_config_dict, not(args.no_tree))
  File "/export/data1/sw/GTDBLite/src/gtdblite/GenomeDatabase.py", line 1550, in MakeTreeData
    prodigal_dir = async_result.get()
  File "/opt/qiime/1.8.0/python-2.7.3-release/lib/python2.7/multiprocessing/pool.py", line 528, in get
    raise self._value
OSError: [Errno 2] No such file or directory

Dereplicate common species

A number of species are represented by an excessive numbers of genomes (e.g. C. difficile, E. coli). It would be beneficial to dereplicate these species for the purposes of inferring a genome tree. This would reduce the time required to infer the tree and help with visualizing the tree. Some care needs to be taken as any removed taxa may need to be updated based on changes made to to the taxonomy. Also, we need to make sure to only remove taxa that will not be of interest to users. As a start, I suggest simply dereplicating well established species and only genomes from IMG.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.