GithubHelp home page GithubHelp logo

rvhonorato / cazy-parser Goto Github PK

View Code? Open in Web Editor NEW
12.0 3.0 8.0 123 KB

A way to extract specific information from CAZy

License: GNU General Public License v3.0

Python 97.51% TeX 2.49%
cazy scrapper carbohydrates data-mining enzymes text-mining

cazy-parser's Introduction

cazy-parser

A way to extract specific information from the Carbohydrate-Active enZYmes.

Downloads status unittests Codacy Badge Codacy Badge

Make sure to visit and cite the CAZy website!

  • http://www.cazy.org/
  • Lombard V, Golaconda Ramulu H, Drula E, Coutinho PM, Henrissat B (2014) The Carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res 42:D490–D495. [PMID: 24270786].

License: GNU GPLv3

RV Honorato. CAZy-parser a way to extract information from the Carbohydrate-Active enZYmes Database. The Journal of Open Source Software, 1(8), dec 2016.

doi: 10.21105/joss.00053

Introduction

cazy-parser is a tool that extract information from CAZy in a more usable and readable format. Firstly, a script reads the HTML structure and creates a mirror of the database as a tab delimited file. Secondly, information is extracted from the database according to user inputted parameters and presented to the user as a set of accession codes.

Install / Upgrade

pip install --upgrade cazy-parser

Usage (internet connection required)

cazy-parser -h
usage: cazy-parser [-h] [-f FAMILY] [-s SUBFAMILY] [-c CHARACTERIZED] [-v] {GH,GT,PL,CA,AA}

positional arguments:
  {GH,GT,PL,CA,AA}

optional arguments:
  -h, --help            show this help message and exit
  -f FAMILY, --family FAMILY
  -s SUBFAMILY, --subfamily SUBFAMILY
  -c CHARACTERIZED, --characterized CHARACTERIZED
  -v, --version         show version

Example

Extract all fasta sequences from family 43 of Glycoside Hydrolase subfamily 1

$ cazy-parser GH -f 43 -s 1
 [2022-05-26 16:39:21,511 91 INFO] ------------------------------------------
 [2022-05-26 16:39:21,511 92 INFO]
 [2022-05-26 16:39:21,511 93 INFO] ┌─┐┌─┐┌─┐┬ ┬   ┌─┐┌─┐┬─┐┌─┐┌─┐┬─┐
 [2022-05-26 16:39:21,511 94 INFO] │  ├─┤┌─┘└┬┘───├─┘├─┤├┬┘└─┐├┤ ├┬┘
 [2022-05-26 16:39:21,511 95 INFO] └─┘┴ ┴└─┘ ┴    ┴  ┴ ┴┴└─└─┘└─┘┴└─ v2.0.1
 [2022-05-26 16:39:21,511 96 INFO]
 [2022-05-26 16:39:21,511 97 INFO] ------------------------------------------
 [2022-05-26 16:39:21,511 183 INFO] Fetching links for Glycoside-Hydrolases, url: http://www.cazy.org/Glycoside-Hydrolases.html
 [2022-05-26 16:39:22,454 189 INFO] Only using links of family 43 subfamily 1
 [2022-05-26 16:39:23,029 26 INFO] Dowloading 1415 fasta sequences...
 [2022-05-26 16:40:32,187 51 INFO] Dumping fasta sequences to file GH43_1_26052022.fasta

This will generate the following file GH43_1_DDMMYYY.fasta containing the fasta sequences.

To-do and how to contribute

Please refer to CONTRIBUTING 🤓

cazy-parser's People

Contributors

arfon avatar dependabot[bot] avatar rvhonorato avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cazy-parser's Issues

not all members extracted

Hi Rodrigo,
I tried to extract a few families and I noticed that out of e.g. 3000 entries in CAZy, I only get 250 when I run cazy-parser.
Is that the expected behavior?

Thanks,
George

Retrieve only the experimentally characterized cazy?

Hi,
I want to retrieve only the experimentally characterized cazy. Was the -c CHARACTERIZED argument set for this? I'm not sure how to set this argument since whether or not add -c True does not affect the downloaded number of sequences.
Thanks!
image

Retrieve structures

Some enzymes have a structural annotation in CAZy, for the ones that do not we should be able to also retrieve structures from AlphaFold.

Using Uniprot's ID Mapping API, it should be possible to get all GENBANK results of a given query, map them to their UniprotID and recover the AlphaFold structure

`create_cazy_db` fails

Unable to create database on Python 2.7.13. Output (exlcucing BeautifulSoup warning) as follows:

>> Gathering species codes for species with full genomes
>> Glycoside-Hydrolases
>> 145 families found on http://www.cazy.org/Glycoside-Hydrolases.html
> GH1

then error

first_page_idx = int(page_index_list[0]['href'].split('PRINC=')[-1].split('#')[0]) # be careful with this
ValueError: invalid literal for int() with base 10: 'GH1_archaea.html?debut_TAXO=100'

Has the pagination code changed for the expression to fail?

Tests dependent on hardcoded family size

The following is checking for a hardcoded value, so it will eventually fail when the database increases

def test_retrieve_genbank_ids():
   # ...
    assert len(observed_id_list) == 1163

Skip fasta fetching if entrez endpoint is not accessible

In the last step of the execution, the fastaIDs are passed to NCBI's entrez efetch endpoint to get the full sequences, however it might not be accessible, causing the execution to fail.

A possible fallback in case of failure is to dump the IDs and instruct the user to manually fetch them.

port to Python 3.x?

great library, and a couple of brackets should make it compatible w/ Python 3

fail on create_cazy_db

Hello,
When using create_cazy_db, it fails with the following error:

Gathering species with full genomes
Elapsed Time: 0:00:00 N/A% [ ] ETA: --:--:-- Traceback (most recent call last):
File "/usr/local/bin/create_cazy_db", line 11, in
sys.exit(main())
File "/usr/local/lib/python2.7/dist-packages/cazy_parser/create_cazy_db.py", line 292, in main
species_dic = fetch_species()
File "/usr/local/lib/python2.7/dist-packages/cazy_parser/create_cazy_db.py", line 69, in fetch_species
f = urllib.request.urlopen(link)
AttributeError: 'module' object has no attribute 'request'

Am I missing something?
Thanks for developing this!
Best,

Errors occurred using pip installation

Thanks for your recently update for this repo. It is very very helpful.
Additionally, I encountered an error while installing the pkg using pip followed README. It shows bellow,

  1. CANNOT call 'cazy-parser' using bash command.
  2. --upgrade flag cannot procced pkg upgrading.

Fortunately, I used configure and install from setup.py worked successfully.
Maybe it is not an critical error, just want to leave a note.
Again, thank you for your great work!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.