GithubHelp home page GithubHelp logo

guyleonard / get_jgi_genomes Goto Github PK

View Code? Open in Web Editor NEW
27.0 3.0 5.0 89 KB

A quick and easy way to download the genomes/predicted proteins of taxa available in JGI's Genome Portal.

License: GNU General Public License v3.0

Perl 100.00%
fungi jgi genomes genome protein-sequences mycocosm phytozome algae archaeplastida metazome

get_jgi_genomes's Introduction

Get JGI Genomes!

Download files from the genomes contained within JGI's various -zomes and -cosms.

Usage

Login to JGI with your username and password (-u and -p) to generate the required 'cookie' file to allow your downloads to process. This will work for your current session only, and expires daily.

Then use one of the portal options (-f, -a, or -P, -m with a version number) to download the files from the available genome projects. You can also choose the type of data you wish to download (with -A, -C, -g or -t), the default is to download amino acid (protein/peptide) sequences.

You can generate a list of all the available genomes with the '-l' option, this means no downloads will occur.

Usage:
  get_jgi_genomes [-u <username> -p <password>] | [-c <cookies>] [-f | -a | -P 12 | -m 3] (-i) (-l) (-A) (-C) (-g) (-t) (-q)

Required:
	-u <username>
	-p <password>
or
	-c <cookie file>
Portal Choice:
	-f Mycocosm aka fungi
	-a Phycocosm aka algae
	-P <version> PhytozomeV aka plants
	-m <version> MetazomeV aka metazoans
Portal File Options:
	-A get assembly
	-C get CDS
	-g get GFF
	-t get transcripts
JGI Taxa ID:
	-i <id> JGI ID of Genome Project
Other:
	-l list only, no downloads

Notes

Phycocosm

As of writing (July 2020) the Phycocosm portal lists 77 genomes available, however not all of these seem to be available in the XML for that portal. Only about 37 of them are available, the others - mostly from archaeplastida - are available from Phytozome.

Phytozome

Currently versions 9 to 12 work with this script (point releases, e.g. 12.1, do not seem to work, so please use whole integers only). The newer, 'phytozome-next' or V13 is available at "https://phytozome-next.jgi.doe.gov/". Currently, I see no way of adding access this to the script. There is some form of limited CLI download, but it looks like you need to have an active connection in your browser to generate the download link, and you also have to select files via the clunky search interface (e.g. how do you select all predicted proteins only, it looks like you have to manually select them for each taxa).

Metazome

Metazome does not seem to be maintained, and occasionally has file download issues, generally it is very slow, but version '3' seems to download. It also looks like it is being ported to the new-style of interface that is available with phytozome-next.

Other

XML files are automatically refreshed after 10 days, or if you delete the file and re-run your commands.

Output is in your local folder within a directory named after the portal and the type of data you requested.

Examples

To login:

./bin/get_jgi_genomes -u [email protected] -p y0uR_P@$$W0r4

To download a list of all protein files from Mycocosm after you have logged in:

./bin/get_jgi_genomes -c signon.cookie -f -l

To download all CDS files from Phycocosm after you have logged in:

./bin/get_jgi_genomes -c signon.cookie -a -C

To download all assembly files from Phytozome V12 after you have logged in:

./bin/get_jgi_genomes -c signon.cookie -P 12 -A

To download proteins of 'Boleraceacapitata' from Phytozome V12 after you have logged in:

./bin/get_jgi_genomes -c signon.cookie -P 12 -i Boleraceacapitata

To download transcripts of 'Trire2' from Mycocosm after you have logged in:

./bin/get_jgi_genomes -c signon.cookie -f -i Trire2 -t

Other Genome Download Tools

Broken?

If the downloads of the XML or AA files no longer work, it probably means JGI have changed something in the layout of their XML files, let me know and I will try and update it or feel free to pass along a pull request with your fixes.

get_jgi_genomes's People

Contributors

guyleonard avatar siddacious avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

get_jgi_genomes's Issues

Download mycocosm protein files - pep directory empty

Hi! I was hoping to download all of the protein files from the JGI mycocosm database. I tried to do this using the example command you provided in the README file (./bin/get_jgi_genomes -c signon.cookie -f -l) but the resulting /fungi/pep directory is empty. The resulting /fungi/pep_list.txt has all of the file IDs and URLs (e.g., Aaosphaeria arxii CBS 175.79 v1.0 Aaoar1 https://genome.jgi.doe.gov/portal/Aaoar1/download/Aaoar1_GeneCatalog_proteins_20140429.aa.fasta.gz).

Do you have any idea why the pep files didn't download? I didn't notice any errors when running the 'get_jgi_genomes' script, so any insight is greatly appreciated!

Change default output?

Hi! I'm not much of a scripter so I was wondering: why exactly is amino acid sequence a default output? From my understanding, JGI stores genomes - so is there a step in the code that translates the nucleotide sequence? Is there a way to get nucleotide sequences as output? Thanks!

Example files and usage?

I want to download a couple of cyanobacteria genomes. Do you have any tutorials on how to run this with example files?

sed: 1: "phycocosm_files.xml": extra characters at the end of p command

I tried running the tooling got the following error:

(base) Joshs-MBP:get_jgi_genomes jolespin$ ./bin/get_jgi_genomes -u $USERNAME -p "$PASSWORD" -c signon.cookie -a
INFO: Attempting to login...
Running: curl --silent 'https://signon.jgi.doe.gov/signon/create' --data-urlencode '[email protected]' --data-urlencode 'password=Tri3for(3' -c signon.cookie > /dev/null
INFO: Login Successfull!
INFO: User Selected Phycocosm aka phycocosm
Downloading phycocosm XML - This may take a few minutes...
Running: curl 'https://genome.jgi.doe.gov/portal/ext-api/downloads/get-directory?organism=phycocosm' -b signon.cookie > phycocosm_files.xml
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3023k    0 3023k    0     0  44061      0 --:--:--  0:01:10 --:--:--  833k
sed: 1: "phycocosm_files.xml": extra characters at the end of p command
Error 256 running command

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.