rvhonorato / cazy-parser Goto Github PK

View Code? Open in Web Editor NEW

12.0 3.0 8.0 123 KB

A way to extract specific information from CAZy

License: GNU General Public License v3.0

Python 97.51% TeX 2.49%

cazy scrapper carbohydrates data-mining enzymes text-mining

cazy-parser's Introduction

cazy-parser

A way to extract specific information from the Carbohydrate-Active enZYmes.

Make sure to visit and cite the CAZy website!

http://www.cazy.org/
Lombard V, Golaconda Ramulu H, Drula E, Coutinho PM, Henrissat B (2014) The Carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res 42:D490–D495. [PMID: 24270786].

License: GNU GPLv3

RV Honorato. CAZy-parser a way to extract information from the Carbohydrate-Active enZYmes Database. The Journal of Open Source Software, 1(8), dec 2016.

doi: 10.21105/joss.00053

Introduction

cazy-parser is a tool that extract information from CAZy in a more usable and readable format. Firstly, a script reads the HTML structure and creates a mirror of the database as a tab delimited file. Secondly, information is extracted from the database according to user inputted parameters and presented to the user as a set of accession codes.

Install / Upgrade

pip install --upgrade cazy-parser

Usage (internet connection required)

cazy-parser -h
usage: cazy-parser [-h] [-f FAMILY] [-s SUBFAMILY] [-c CHARACTERIZED] [-v] {GH,GT,PL,CA,AA}

positional arguments:
  {GH,GT,PL,CA,AA}

optional arguments:
  -h, --help            show this help message and exit
  -f FAMILY, --family FAMILY
  -s SUBFAMILY, --subfamily SUBFAMILY
  -c CHARACTERIZED, --characterized CHARACTERIZED
  -v, --version         show version

Example

Extract all fasta sequences from family 43 of Glycoside Hydrolase subfamily 1

$ cazy-parser GH -f 43 -s 1
 [2022-05-26 16:39:21,511 91 INFO] ------------------------------------------
 [2022-05-26 16:39:21,511 92 INFO]
 [2022-05-26 16:39:21,511 93 INFO] ┌─┐┌─┐┌─┐┬ ┬   ┌─┐┌─┐┬─┐┌─┐┌─┐┬─┐
 [2022-05-26 16:39:21,511 94 INFO] │  ├─┤┌─┘└┬┘───├─┘├─┤├┬┘└─┐├┤ ├┬┘
 [2022-05-26 16:39:21,511 95 INFO] └─┘┴ ┴└─┘ ┴    ┴  ┴ ┴┴└─└─┘└─┘┴└─ v2.0.1
 [2022-05-26 16:39:21,511 96 INFO]
 [2022-05-26 16:39:21,511 97 INFO] ------------------------------------------
 [2022-05-26 16:39:21,511 183 INFO] Fetching links for Glycoside-Hydrolases, url: http://www.cazy.org/Glycoside-Hydrolases.html
 [2022-05-26 16:39:22,454 189 INFO] Only using links of family 43 subfamily 1
 [2022-05-26 16:39:23,029 26 INFO] Dowloading 1415 fasta sequences...
 [2022-05-26 16:40:32,187 51 INFO] Dumping fasta sequences to file GH43_1_26052022.fasta

This will generate the following file GH43_1_DDMMYYY.fasta containing the fasta sequences.

To-do and how to contribute

Please refer to CONTRIBUTING 🤓

cazy-parser's People

Contributors

Stargazers

Watchers

Forkers

brunatrajano lonsbio mobiusklein raphamendonca danny305 kugatomodai kristenthorne 25280841

cazy-parser's Issues

Retrieve structures

Some enzymes have a structural annotation in CAZy, for the ones that do not we should be able to also retrieve structures from AlphaFold.

Using Uniprot's ID Mapping API, it should be possible to get all GENBANK results of a given query, map them to their UniprotID and recover the AlphaFold structure

Audit url open for permitted schemes. Allowing use of file:/ or custom schemes is often unexpected.

Codacy detected an issue:

Message: `Audit url open for permitted schemes. Allowing use of file:/ or custom schemes is often unexpected.`

Occurred on:

Commit: b62921e
File: src/cazy_parser/modules/html.py
LineNum: 122
Code: soup = BeautifulSoup(urllib.request.urlopen(link), features="html.parser")

Currently on:

Commit: 104e4a2
File: src/cazy_parser/modules/html.py
LineNum: 122

Skip fasta fetching if entrez endpoint is not accessible

In the last step of the execution, the fastaIDs are passed to NCBI's entrez efetch endpoint to get the full sequences, however it might not be accessible, causing the execution to fail.

A possible fallback in case of failure is to dump the IDs and instruct the user to manually fetch them.

Retrieve only the experimentally characterized cazy?

Hi,
I want to retrieve only the experimentally characterized cazy. Was the -c CHARACTERIZED argument set for this? I'm not sure how to set this argument since whether or not add -c True does not affect the downloaded number of sequences.
Thanks!

port to Python 3.x?

great library, and a couple of brackets should make it compatible w/ Python 3

Errors occurred using pip installation

Thanks for your recently update for this repo. It is very very helpful.
Additionally, I encountered an error while installing the pkg using pip followed README. It shows bellow,

CANNOT call 'cazy-parser' using bash command.
--upgrade flag cannot procced pkg upgrading.

Fortunately, I used configure and install from setup.py worked successfully.
Maybe it is not an critical error, just want to leave a note.
Again, thank you for your great work!

fail on create_cazy_db

Hello,
When using create_cazy_db, it fails with the following error:

Gathering species with full genomes
Elapsed Time: 0:00:00 N/A% [ ] ETA: --:--:-- Traceback (most recent call last):
File "/usr/local/bin/create_cazy_db", line 11, in
sys.exit(main())
File "/usr/local/lib/python2.7/dist-packages/cazy_parser/create_cazy_db.py", line 292, in main
species_dic = fetch_species()
File "/usr/local/lib/python2.7/dist-packages/cazy_parser/create_cazy_db.py", line 69, in fetch_species
f = urllib.request.urlopen(link)
AttributeError: 'module' object has no attribute 'request'

Am I missing something?
Thanks for developing this!
Best,

AttributeError while parsing

Hi Rodrigo,

I still have a problem with the script. Could you take a look? Thanks in advance.

Tests dependent on hardcoded family size

The following is checking for a hardcoded value, so it will eventually fail when the database increases

def test_retrieve_genbank_ids():
   # ...
    assert len(observed_id_list) == 1163

not all members extracted

Hi Rodrigo,
I tried to extract a few families and I noticed that out of e.g. 3000 entries in CAZy, I only get 250 when I run cazy-parser.
Is that the expected behavior?

Thanks,
George

organizing as a setup.py compatible project for pypi etc

Here from the JOSS review thread:

This project is not organized in the canonical ways a distributable piece of code should be in the standard Python universe. I recommend reviewing https://packaging.python.org/ and consider distributing these as entry_points console scripts.
Organizing and distributing your code in standard ways makes it more likely that other OSS users can take advantage of your brilliance.

`create_cazy_db` fails

Unable to create database on Python 2.7.13. Output (exlcucing BeautifulSoup warning) as follows:

>> Gathering species codes for species with full genomes
>> Glycoside-Hydrolases
>> 145 families found on http://www.cazy.org/Glycoside-Hydrolases.html
> GH1

then error

first_page_idx = int(page_index_list[0]['href'].split('PRINC=')[-1].split('#')[0]) # be careful with this
ValueError: invalid literal for int() with base 10: 'GH1_archaea.html?debut_TAXO=100'

Has the pagination code changed for the expression to fail?