GithubHelp home page GithubHelp logo

get_assemblies's Introduction

get_assemblies

Who might want this software?

This software is intended to facilitate the use of the NCBI assembly database.

The intended audience is scientific researchers, computational biologists, and bioinformaticians who need sequence data for comparative genomics projects.

Anyone using NCBI sequence data (for reference genomes/transcriptomes) can benefit as well.

PhD/Masters students and undergraduates are especially encouraged to submit issues if they are having trouble using this software.

This software is written in python. I'm happy to help those who are having trouble with software installs.

Windows users, please install WSL to make use of this software. Using a Linux distribution will make your life as a computational researcher significantly easier.

Installation

get_assemblies is distributed on PyPI as a universal wheel and is available on Linux/macOS and supports Python 3.7+. This software will work on Windows using WSL.

get_assemblies depends on NCBI Entrez Direct which NO LONGER requires Perl. Perl is installed by default on most *nix systems. If edirect is not currently installed, please run get_assemblies --dledirect to install.

$ python3 -m pip install -U get-assemblies

Dependencies

Python modules:

  1. rich

See requirements.txt for more info.

External programs:

  1. NCBI Entrez Direct

You can install external programs using the get_assemblies --dledirect command. These will be installed to ${HOME}/edirect unless otherwise specified.

Just tell me how to run it

$ get_assemblies organism 'Pseudomonas fluorescens'

This will find all genomes tagged as 'Pseudomonas fluorescens' in NCBI's database. By default, this command will only check to see how many genomes fall into this category.

To download metadata for the genomes:

$ get_assemblies organism 'Pseudomonas fluorescens' --function metadata

Check out the metadata.tab file that is created after running this command. Generally you will want to select a subset from your search. One way to do this is to select the lines that include the genomes of interest, and then saving the assembly accessions to a file. You can either delete the lines that you don't want, or use grep to pull out the lines that you want to keep. Then you can use cut -f 15 > accs.txt to get the assembly accesions in a file.

$ cat accs.txt | get_assemblies assembly_ids - --function genomes -o fna

This will download the nucleotide fasta files for your genomes of interest.

Overview

This tool was written to make accessing genomic data from NCBI easier. The output files are renamed such that each assembly has a Genus species strain in the filename to make it easy to find the genomes that you're interested in. You won't have to spend time renaming the files by hand.

This software is effectively a wrapper for the NCBI edirect tools that makes getting genome files easier. If you are interested in starting a comparative genomics project, this is the tool for you.

The software supports four types of input:

  1. organism input, either taxonomy rank names (e.g. Genus species, Family) or taxids
  2. assembly ids, either accessions or uids
  3. nuccore ids (e.g. individual contig/chromosome names)
  4. json input (e.g. the intermediate files - docsums - produced by this script)

Five file type outputs are supported:

  1. Nucleotide genome sequence (fna)
  2. Nucleotide coding sequence (ffn)
  3. Amino acid coding sequence (faa)
  4. General feature format (i.e. tab-delimited features) (gff)
  5. GenBank format (gbk)

The program will attempt to find a unique prefix per genome assembly. This prefix will be in the resulting filename. A metadata file that contains much of the relevant information per genome will also be included. This file can be included as a supplementary table for a manuscript in a comparative genomics project.

If you need to make phylogenetic trees with these data, check out my other python package, automlsa2.

More Examples

$ get_assemblies organism 'Mycobacterium'
2020-10-15 22:49:53,257 - INFO - Found 7522 genomes to download.
2020-10-15 22:49:53,257 - INFO - Expect 37610MB to 52654MB of data.
$ get_assemblies organism --type ID 167539 --function genomes -o gbk
2020-10-15 23:10:13,822 - INFO - Found 1 genomes to download.
2020-10-15 23:10:13,822 - INFO - Expect 5MB to 7MB of data pending the chosen file types for download.
chunk: 1it [00:01,  1.21s/it]
docsums: 100%|██████████████████████████████| 1/1 [00:00<00:00, 5146.39it/s]
2020-10-15 23:10:16,262 - INFO - Downloading 1 files.
100% [##################################################]           1M / 1M]
2020-10-15 23:10:18,044 - INFO - P_marinus_CCMP1375_SS120.gbk successfully downloaded.
download: 100%|███████████████████████████████| 1/1 [00:01<00:00,  1.78s/it]
$ ls
docsums0.json       metadata.tab
get_assemblies.log  P_marinus_CCMP1375_SS120.gbk
$ echo GCA_000269645.2 | get_assemblies assembly_ids -
2020-10-15 23:18:04,107 - INFO - Found 1 genomes to download.
2020-10-15 23:18:04,107 - INFO - Expect 5MB to 7MB of data pending the chosen file types for download.

Usage

$ get_assemblies -h
usage: get_assemblies [-h] [--debug] [--version] [--dledirect [DLEDIRECT]] {organism,assembly_ids,nuccore_ids,json_input} ...

Downloads assemblies & annotations from NCBI.

positional arguments:
  {organism,assembly_ids,nuccore_ids,json_input}
                        Choose from this list of input types.
    organism            Valid NCBI organism or taxids.
    assembly_ids        Valid NCBI assembly IDs.
    nuccore_ids         Valid NCBI nucleotide accessions.
    json_input          Valid NCBI JSON docsums.

optional arguments:

-h, --help show this help message and exit
--debug Turn on debugging messages.
--version show program's version number and exit
--dledirect <[DLEDIRECT]>
 Download edirect to given location. [~/edirect]

Organism

$ get_assemblies organism -h
usage: get_assemblies organism [-h] [--type {text,ID}] [--function {check,metadata,genomes} [{check,metadata,genomes} ...]]
                               [--annotation] [--metadata_append] [--typestrain] [--keepmulti] [--force]
                               [-f {abbr,full,strain}] [-o {fna,ffn,gff,gbk,faa,all} [{fna,ffn,gff,gbk,faa,all} ...]]
                               [--edirect EDIRECT] [--debug]
                               query

positional arguments:
  query                 Valid NCBI organism text term or ID

optional arguments:

-h, --help show this help message and exit
--type <{text,ID}>
 Input is text term (default) or ID
--function <{check,metadata,genomes} [{check,metadata,genomes} ...]>
 check counts, download metadata, or genomes. [check]
--annotation Require annotation? False by default, True if gbk/faa/ffn requested
--metadata_append
 Append to metadata, not overwrite.
--typestrain Only download type strains.
--keepmulti By default, genomes from large multi-isolatestudies are removed.
--force Force download attempt of low-quality genomes.
-f <{abbr,full,strain}, --outformat {abbr,full,strain}>
 Output file prefix. [full]
-o <{fna,ffn,gff,gbk,faa,all} [{fna,ffn,gff,gbk,faa,all} ...]>
 Output file types.
--edirect EDIRECT
 Path to edirect directory.
--debug Turn on debugging messages.

Assembly IDs

$ get_assemblies assembly_ids -h
usage: get_assemblies assembly_ids [-h] [--type {acc,uid}]
                                   [--function {check,metadata,genomes} [{check,metadata,genomes} ...]] [--annotation]
                                   [--metadata_append] [--typestrain] [--keepmulti] [--force] [-f {abbr,full,strain}]
                                   [-o {fna,ffn,gff,gbk,faa,all} [{fna,ffn,gff,gbk,faa,all} ...]] [--edirect EDIRECT]
                                   [--debug]
                                   infile

positional arguments:
  infile                Input file with NCBI assembly IDs; "-" for stdin

optional arguments:

-h, --help show this help message and exit
--type <{acc,uid}>
 Input is Accession (default) or ID
--function <{check,metadata,genomes} [{check,metadata,genomes} ...]>
 check counts, download metadata, or genomes. [check]
--annotation Require annotation? False by default, True if gbk/faa/ffn requested
--metadata_append
 Append to metadata, not overwrite.
--typestrain Only download type strains.
--keepmulti By default, genomes from large multi-isolatestudies are removed.
--force Force download attempt of low-quality genomes.
-f <{abbr,full,strain}, --outformat {abbr,full,strain}>
 Output file prefix. [full]
-o <{fna,ffn,gff,gbk,faa,all} [{fna,ffn,gff,gbk,faa,all} ...]>
 Output file types.
--edirect EDIRECT
 Path to edirect directory.
--debug Turn on debugging messages.

Nucleotide IDs

$ get_assemblies nuccore_ids -h
usage: get_assemblies nuccore_ids [-h] [--function {check,metadata,genomes} [{check,metadata,genomes} ...]] [--annotation]
                                  [--metadata_append] [--typestrain] [--keepmulti] [--force] [-f {abbr,full,strain}]
                                  [-o {fna,ffn,gff,gbk,faa,all} [{fna,ffn,gff,gbk,faa,all} ...]] [--edirect EDIRECT] [--debug]
                                  infile

positional arguments:
  infile                Input file with NCBI nuccore IDs; "-" for stdin

optional arguments:

-h, --help show this help message and exit
--function <{check,metadata,genomes} [{check,metadata,genomes} ...]>
 check counts, download metadata, or genomes. [check]
--annotation Require annotation? False by default, True if gbk/faa/ffn requested
--metadata_append
 Append to metadata, not overwrite.
--typestrain Only download type strains.
--keepmulti By default, genomes from large multi-isolatestudies are removed.
--force Force download attempt of low-quality genomes.
-f <{abbr,full,strain}, --outformat {abbr,full,strain}>
 Output file prefix. [full]
-o <{fna,ffn,gff,gbk,faa,all} [{fna,ffn,gff,gbk,faa,all} ...]>
 Output file types.
--edirect EDIRECT
 Path to edirect directory.
--debug Turn on debugging messages.

NCBI JSON Docsum input

$ get_assemblies json_input -h
usage: get_assemblies json_input [-h] [--function {metadata,genomes} [{metadata,genomes} ...]] [--annotation]
                                 [--metadata_append] [--typestrain] [--keepmulti] [--force] [-f {abbr,full,strain}]
                                 [-o {fna,ffn,gff,gbk,faa,all} [{fna,ffn,gff,gbk,faa,all} ...]] [--edirect EDIRECT] [--debug]
                                 jsonfile [jsonfile ...]

positional arguments:
  jsonfile              Input JSON file with docsums; "-" for stdin

optional arguments:

-h, --help show this help message and exit
--function <{metadata,genomes} [{metadata,genomes} ...]>
 Download metadata and/or genomes. [metadata]
--annotation Require annotation? False by default, True if gbk/faa/ffn requested
--metadata_append
 Append to metadata, not overwrite.
--typestrain Only download type strains.
--keepmulti By default, genomes from large multi-isolatestudies are removed.
--force Force download attempt of low-quality genomes.
-f <{abbr,full,strain}, --outformat {abbr,full,strain}>
 Output file prefix. [full]
-o <{fna,ffn,gff,gbk,faa,all} [{fna,ffn,gff,gbk,faa,all} ...]>
 Output file types.
--edirect EDIRECT
 Path to edirect directory.
--debug Turn on debugging messages.

Bugs

Viruses are currently not handled well, if at all. Look elsewhere to download those.

Contributing

Feel free to submit bug reports or pull requests so we can improve this software. Undoubtedly there will be some erroneous prefixes generated out there, and I'd like to fix them.

Author Contact

Ed Davis

Acknowledgments

Special thanks for helping me test the software and get the python code packaged:

Also, thanks to these groups for supporting me through my scientific career:

License

get_assemblies is distributed under the terms listed in the LICENSE file. The software is free for non-commercial use.

Copyrights

Copyright (c) 2020 Oregon State University

All Rights Reserved.

get_assemblies's People

Contributors

davised avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

vikash84

get_assemblies's Issues

mixed genomes

Hi @davised

So I tried downloading some genomes classified as "Agrobacterium rhizogenes". However, I see that other genomes are also downloaded, including those of Bacillus, Enterococcus, Leptospira, Salmonella, Staphylococcus.

I mean it would be easy to depurate these genomes from the collection. I assume this error originates in the database, right? Or is it because of the get_assemblies package?

I ran the following command:
cat metadata.tab | get_assemblies assembly_ids - --function genomes -o fna

It seems that all the 96 genomes were downloaded however I got the following error:
`Traceback (most recent call last):
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2449, in retrfile
self.ftp.cwd(file)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 625, in cwd
return self.voidcmd(cmd)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 286, in voidcmd
return self.voidresp()
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 259, in voidresp
resp = self.getresp()
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 254, in getresp
raise error_perm(resp)
ftplib.error_perm: 550 GCA_001367915.1_10493_1: No such file or directory

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1572, in ftp_open
fp, retrlen = fw.retrfile(file, type)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2451, in retrfile
raise URLError('ftp error: %r' % reason) from reason
urllib.error.URLError: <urlopen error ftp error: error_perm('550 GCA_001367915.1_10493_1: No such file or directory')>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/main.py", line 121, in dl_gzip
copy_url(pbar, task_id, uri, filename)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/main.py", line 142, in copy_url
with urlopen(uri) as response:
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 519, in open
response = self._open(req, data)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 536, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 496, in _call_chain
result = func(*args)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1583, in ftp_open
raise exc.with_traceback(sys.exc_info()[2])
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1572, in ftp_open
fp, retrlen = fw.retrfile(file, type)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2451, in retrfile
raise URLError('ftp error: %r' % reason) from reason
urllib.error.URLError: <urlopen error ftp error: URLError("ftp error: error_perm('550 GCA_001367915.1_10493_1: No such file or directory')")>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1565, in ftp_open
fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1586, in connect_ftp
return ftpwrapper(user, passwd, host, port, dirs, timeout,
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2407, in init
self.init()
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2419, in init
self.ftp.cwd(_target)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 625, in cwd
return self.voidcmd(cmd)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 286, in voidcmd
return self.voidresp()
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 259, in voidresp
resp = self.getresp()
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 254, in getresp
raise error_perm(resp)
ftplib.error_perm: 550 genomes/all/GCA/001/367/915/GCF_001367915.1_10493_1_6: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/bin/get_assemblies", line 8, in
sys.exit(main())
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/main.py", line 1101, in main
download_genomes(args.o, dl_mapping, args.threads)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/main.py", line 1058, in download_genomes
output = future.result()
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/main.py", line 126, in dl_gzip
copy_url(uri, task_id, uri, filename)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/main.py", line 142, in copy_url
with urlopen(uri) as response:
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 519, in open
response = self._open(req, data)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 536, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 496, in _call_chain
result = func(*args)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1583, in ftp_open
raise exc.with_traceback(sys.exc_info()[2])
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1565, in ftp_open
fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1586, in connect_ftp
return ftpwrapper(user, passwd, host, port, dirs, timeout,
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2407, in init
self.init()
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2419, in init
self.ftp.cwd(_target)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 625, in cwd
return self.voidcmd(cmd)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 286, in voidcmd
return self.voidresp()
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 259, in voidresp
resp = self.getresp()
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 254, in getresp
raise error_perm(resp)
urllib.error.URLError: <urlopen error ftp error: error_perm('550 genomes/all/GCA/001/367/915/GCF_001367915.1_10493_1_6: No such file or directory')>
`

Is the error showing at the end important? I am attaching the log file as well.

Cheers,
Pablo
get_assemblies.log

metadata information?

Hi @davised

I am interested in using this package for getting assemblies. However, I was wondering if there is a way to also obtain the metadata for each assembly? By metadata I mean the assembly statistics and some of the information contained in the general report (BioSample, BioProject, Submitter...).

Cheers,
Pablo

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.