GithubHelp home page GithubHelp logo

widdowquinn / ncfp Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 2.0 492 KB

Program and package that retrieves nucleotide coding sequences from NCBI that correspond to a set of input protein sequences.

Home Page: https://widdowquinn.github.io/ncfp/

License: MIT License

Python 96.92% Makefile 3.08%
bioinformatics sequence-to-sequence backtranslation ncbi uniprot protein-to-nucleotide

ncfp's Introduction

README.md - ncfp

This repository contains code for a script that identifies and writes the corresponding nucleotide sequences for each protein in an input multiple sequence file to be used, for example, in backthreading coding sequences onto protein alignments for phylogenetic analyses. ncfp uses the NCBI accession or UniProt gene name (as appropriate) to identify source nucleotide sequences in the NCBI databases, download them, and write them to a file.

CircleCI ncfp codecov.io coverage Codacy Badge CodeFactor ncfp documentation

ncfp PyPi version ncfp licence ncfp PyPi version ncfp PyPi version ncfp PyPi version

Anaconda-Server Badge Anaconda-Server Badge

Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

Quickstart: ncfp at the command-line

Providing an input file of protein sequences as <INPUT>.fasta, and writing output to the directory <OUTPUT>, while specifying a user email to NCBI of <EMAIL> will generate two files: <OUTPUT>/ncfp_aa.fasta and <OUTPUT>/ncfp_nt.fasta.

ncfp <INPUT>.fasta <OUTPUT> <EMAIL>

The file <OUTPUT>/ncfp_aa.fasta contains the amino acid sequences for all input proteins for which a corresponding nucleotide coding sequence could be identified, in FASTA format.

The file <OUTPUT>/ncfp_nt.fasta contains nucleotide coding sequences, where they could be found, for all the input proteins, in FASTA format.

Any input protein sequences for which a corresponding nucleotide sequence cannot be recovered, for any reason, are placed in the file <OUTPUT>/skipped.fas.

To find out more about what ncfp can do, try

ncfp --help

at the command-line

Documentation

For more detailed information about ncfp as a program, or using the underlying ncbi_cds_from_protein Python module, please see the stable version documentation at https://ncfp.readthedocs.io/en/stable/

License

Unless otherwise indicated, all code is licensed under the MIT license and subject to the following agreement:

(c) The James Hutton Institute 2017-2019
(c) The University of Strathclyde 2019-2023
Author: Leighton Pritchard

Contact: [email protected]

Address:
Leighton Pritchard,
Strathclyde Institute for Pharmacy and Biomedical Sciences,
Cathedral Street,
Glasgow,
G4 0RE,
Scotland,
UK

The MIT License

Copyright (c) 2017-2019 The James Hutton Institute Copyright (c) 2019-2023 The University of Strathclyde

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

ncfp's People

Contributors

widdowquinn avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

ncfp's Issues

Writing log file fails if only filename given

Summary:

When using -l <FILENAME> if <FILENAME> does not contain a valid directory path, a FileNotFoundError is thrown.

Description:

Attempting to open the log file splits the path to identify the parent directory/ies. If no leading path to a file in the current directory is given, e.g. myfile.log instead of ./myfile.log, os.makedirs() attempts to create a directory '', which fails.

The solution may be to move to pathlib for all filepath handling.

Reproducible Steps:

Any ncfp command with -l myfile.txt in the command.

Current Output:

A FileNotFound error

Expected Output:

No FileNotFound error

UniProt warning

Summary:

When using ncfp, a warning is thrown by BioServices and downloads fail.

Description:

With the command below:

ncfp -v -l 2022-07-20_th.log -c local_cache --keepcache -s helixalifil1.fasta ncfp_out [email protected]

using the attached input file, the following error occurs:

[...]
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: 1020 sequence records read successfully from helixalifil1.fasta
[INFO] [ncbi_cds_from_protein.sequences]: Processing sequences...
1/5 Process input sequences:   0%|                                                                                                                                     | 0/1020 [00:00<?, ?it/s]WARNING [bioservices.UniProt:596]:  status is not ok with Bad Request
1/5 Process input sequences:   0%|                                                                                                                                     | 0/1020 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/opt/anaconda3/envs/ncfp_py310/bin/ncfp", line 33, in <module>
    sys.exit(load_entry_point('ncfp', 'console_scripts', 'ncfp')())
  File "/Users/lpritc/Documents/Development/GitHub/ncfp/ncbi_cds_from_protein/scripts/ncfp.py", line 267, in run_main
    qrecords, qskipped = process_sequences(seqrecords, cachepath, args.disabletqdm)
  File "/Users/lpritc/Documents/Development/GitHub/ncfp/ncbi_cds_from_protein/sequences.py", line 128, in process_sequences
    qstring = result.split("\n")[1].strip()[:-1]
AttributeError: 'int' object has no attribute 'split'

With teh --debug option set, the additional useful output is:

[INFO] [ncbi_cds_from_protein.sequences]: Processing sequences...
[DEBUG] [ncbi_cds_from_protein.sequences]: Guessing sequence type for tr|A0A258M961|A0A258M961_9BURK/52-81...
[DEBUG] [ncbi_cds_from_protein.sequences]: ...guessed UniProt
[DEBUG] [ncbi_cds_from_protein.sequences]: Uniprot record has GN field: B7Y67_11790

helixalifil1.fasta.txt

ncfp Version:

Commit 5e7c612

Python Version:

3.10

Operating System:

macOS

Stockholm domain format doesn't work with non-UniProt FASTA sequences

Summary:

Extracting CDS features uses the GN=.* regex, but if adding Stockholm domains to NCBI FASTA files, this is missing. That causes corresponding features not to be found, leading to false negatives.

We should add an additional check for the sequence ID, not just the GN= field, when that is missing.

Use protein IDs in output

Summary:

Provide an option to use the protein IDs of the query protein sequences in the FASTA file output.

Description:

For backtranslating nucleotide sequences on to aligned protein sequences (for example using tools such as tcoffee) requires the nucleotide sequences and their associated protein sequence to be identifiable by sharing the same ID.

ncfp writes out the ID retrieved from the nucleotide record. Sometimes this is the same ID as the query protein sequence, sometimes this is a different ID. Therefore, using the ncfp output for backthreading nucleotide sequences onto a protein MSA requires additional parsing of the ncfp output, to overwrite the IDs listed in the FASTA output with the IDs of the query protein sequences.

An option such as --use_protein_ids could be included, so that ncfp writes out the protein ID of the query protein for each nucleotide sequence written to the resulting FASTA file.

Current Output:

The ID of the nucleotide record retrieved from NCBI Entrez.

>AN2569.2 coding sequence

Expected Output:

The protein ID of the query protein sequence provided to ncfp

>EAA64674.1 coding sequence

ncfp Version:

v0.2.0

Python Version:

v3.8.6

Operating System:

Ubuntu 20.04.2 LTS

Incorrect protein sequences being retrieved for some accessions

Summary:

Input protein sequences deriving from a known organism (e.g. human) are retrieiving nucleotide sequences from a different organism (e.g. bos taurus).

Description:

The input sequence

>CAD6020544.1/6-36 amtB [Escherichia coli] GN=CAD6020544.1
------DKADNAFMMICTALVLFMTIPGIALFYGGLI

does not give an output nucleotide sequence, as the wrong originating sequence is identified in the Elink linker step.

Reproducible Steps:

With the above sequence as input, run ncfp as normal:

Current Output:

[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Sequence CAD6020544.1/6-36 matches GenBank entry X60065.1
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Searching for CDS: CAD6020544.1
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Could not identify CDS feature for CAD6020544.1/6-36

Expected Output:

A nucleotide coding sequence corresponding to the input protein, in the output directory.

ncfp Version:

commit 694d806

Python Version:

3.9

Operating System:

macOS

When multiple possible CDS matches are found, sometimes the 'wrong one' is processed.

Summary:

Some queries are ambiguous in terms of matching CDS (e.g. a gene name is provided, but no protein_id or other precise accession, and the "wrong one" can be recovered. This currently has two potential results:

  1. the query is skipped because the conceptual translation doesn't match the query
  2. an error is thrown because the Stockholm region of the query falls outside the coding sequence

Reproducible Steps:

With

>tr|A0A127QBK9|A0A127QBK9_9BURK/414-438 [subseq from] Ammonium transporter OS=Collimonas pratensis OX=279113 GN=amt PE=3 SV=1
DVFGVHGVGGIMGALLTGVFAAPSL

as bad_sequence.fasta

issue

ncfp --unify_seqid --debug -l bad_sequence.log -s bad_sequence.fasta bad_sequence [email protected]

Current Output:

    Key: translation, Value: ['MPINIGNTAFMLLCSSLVMLMTPGLAFFYGGLVGRKNVLAIMMQSFISLGWTTVLWFAFGYSMCFGPSWHGIIGDPTYYAFLHGITLSSMYTGNDAGIPLIVHVAYQMMFAIITPALITGAFANRVTFKAYFLFLTGWLVFVYFPFVHMVWSPDGLFAKWGVLDYAGGIVVHNTAGFAALASVLYVGRRQKVELKPHNVPLIALGSGLLWFGWYGFNAGSEFRVDAVTASAFLNTDVAASFGAITWLFIEWFYHKKPKFIGLLTGGVAGLATITPAAGYVSLGTAAIIGICAGLICFYAVALKNRLGWDDALDVWGVHGVGGMAGTILLGVFASKAWNANGADGLLLGNTSFFFAQCGAVIISGIWAFAFTYGMLWLINLFTPVKVGAATQDRMDEDLHGEDAYLHA']

[DEBUG] [ncbi_cds_from_protein.sequences]: Trimming CDS to Stockholm coordinates: 414..438
Traceback (most recent call last):
  File "/Users/lpritc/opt/anaconda3/envs/ncfp_py310/bin/ncfp", line 33, in <module>
    sys.exit(load_entry_point('ncfp', 'console_scripts', 'ncfp')())
  File "/Users/lpritc/Development/ncfp/ncbi_cds_from_protein/scripts/ncfp.py", line 379, in run_main
    nt_sequences = extract_cds_features(seqrecords, cachepath, args)
  File "/Users/lpritc/Development/ncfp/ncbi_cds_from_protein/scripts/ncfp.py", line 192, in extract_cds_features
    ntseq, aaseq = extract_feature_cds(
  File "/Users/lpritc/Development/ncfp/ncbi_cds_from_protein/sequences.py", line 320, in extract_feature_cds
    if aaseq[-1] == "*":
  File "/Users/lpritc/opt/anaconda3/envs/ncfp_py310/lib/python3.10/site-packages/Bio/Seq.py", line 430, in __getitem__
    return chr(self._data[index])
IndexError: index out of range

Expected Output:

Graceful fail (message/warning) saying that the sequence couldn't be matched automatically, or the correct sequence returned.

ncfp Version:

git HEAD

Python Version:

3.10

Operating System:

macOS

Biopython DeprecationWarning for ungap() Method

Summary:

BiopythonDeprecationWarning for theungap() method, advising replacement with replace().

Description:

The code is currently using the deprecated ungap() method in Biopython, which triggers a BiopythonDeprecationWarning. The warning suggests replacing myseq.ungap(gap) with myseq.replace(gap, "") to address the deprecation.

Reproducible Steps:

  1. Obtain the FASTA sequence for the PduK protein reference B1VB70 from UniProt.
  2. Use Biopython to read the sequence and perform a sequence operation that involves the deprecated ungap() method, e.g., myseq.ungap("-").
  3. Observe the BiopythonDeprecationWarning generated.

Current Output:

The correct output is generated with the warning.

Expected Output:

No warning should be generated, and the code should execute without issues

ncfp Version: https://github.com/widdowquinn/ncfp.git

Python Version: 3.9.0

Operating System: Windows10

Tests indicate a deprecated method

Summary:

There is a deprecation warning when running pytest

Description:

../../../../opt/anaconda3/envs/ncfp_py39/lib/python3.9/site-packages/requests_cache/backends/storage/dbdict.py:9
  /Users/lpritc/opt/anaconda3/envs/ncfp_py39/lib/python3.9/site-packages/requests_cache/backends/storage/dbdict.py:9: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
    from collections import MutableMapping

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

Upgrade schema/interface for cache

Summary:

The current cache access is handled through hard-coded SQL statements. This should be reimplemented as an ORM using SQLAlchemy.

Also, UniProt can return multiple EMBL entries for a single protein sequence. The current schema has (accession, aa_query, nt_query) as a row in the seqdata table, with accession as primary key. This permits only one aa/nt query string per record ID. When upgrading, we should revise the schema so that there's a 1:* relationship between accession and each query type.

Not identifying CDS feature despite feature existing

Summary:

Please provide a short summary of the issue (just a couple of sentences).
Although there is a match with genbank and the inforamtion is there, ncfp doesn't manage to find the CDS using the Gn field.

Description:

Please describe the issue as clearly as possible, taking as much space as you need.
Uniprot header sequence matches genbank entry, tries extracting CDS with locus tag which fails, then tries with GN field which doesn't find a match. When searching manually on ncbi page there is a match however.
I remember you doing some investigating and found that at a certain step, ncfp deletes a motif from the beginning of the header so the query can be taken forward. When this error occurs I think it was because there were two motifs so only the first was deleted and the query could not be taken forward.

Reproducible Steps:

Please report steps we can take to reproduce the issue. If it's not possible to reproduce
the issue, please include a description of how you discovered the issue.

I will attach a file to reproduce the error with a sequence that works and one that give the error.

the command i used was

ncfp --unify_seqid -v -s -l file.log in.fasta out.out [email protected] -d caches/new_ncfp -c ncfp_cache

If you have a reproducible example (with data if needed/possible), please include it.

Included

Current Output:

The current output showing the error. It is helpful to understand the current behaviour.

[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Sequence sp|Q9BLG4|AMT1_DICDI/128-148 matches GenBank entry AF510716.1
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Extracting CDS by locus tag with AA query ID: ('DDB_G0277503',)
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Did not find feature with locus tag ('DDB_G0277503',), trying GN field
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Searching for CDS: amtA
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Could not identify CDS feature for sp|Q9BLG4|AMT1_DICDI/128-148

Expected Output:

Describe what you expected the output to be. Knowing the correct behaviour is also very useful.

INFO] [ncbi_cds_from_protein.scripts.ncfp]: Sequence tr|Q22947|Q22947_CAEEL/118-137 matches GenBank entry BX284605.5
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Extracting CDS by locus tag with AA query ID: ('CELE_F08F3.3 F08F3.3',)
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Did not find feature with locus tag ('CELE_F08F3.3 F08F3.3',), trying GN field
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Searching for CDS: rhr-1
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Sequence tr|Q22947|Q22947_CAEEL/118-137 matches CDS feature CCD65593.1
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Extracting coding sequence...
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Translated sequence matches input sequence

ncfp Version:

0.2.1a1
example.txt

Python Version:

3.9.18

Operating System:

Mac

example.txt

Add an indicator of total steps to progress bars

Summary:

Add a marker for the user so that they know how far through the overall process they are - not just how far this step has gone.

Description:

Each step is currently represented with a progress bar, but there is no measure of overall progress. Adding a label of 1/n to each would help.

UniProt protein ID is available but not used for resolution of ambiguous gene names

Summary:

Some UniProt entries supply a GenBank protein ID that could be used to resolve ambiguities, but this is not currently exploited.

Description:

The sequence

>tr|A0A127QBK9|A0A127QBK9_9BURK/414-438 [subseq from] Ammonium transporter OS=Collimonas pratensis OX=279113 GN=amt PE=3 SV=1
DVFGVHGVGGIMGALLTGVFAAPSL

(as in #41 )

has UniProt entry: https://www.uniprot.org/uniprotkb/A0A127QBK9/entry, which crosslinks to protein ID AMP07459.1 (https://www.ncbi.nlm.nih.gov/protein/AMP07459.1) - but ncfp only uses the GN=amt term to search within the downloaded GenBank file.

Reproducible Steps:

Issue

ncfp --unify_seqid --use_protein_ids --debug -l bad_sequence.log -s bad_sequence.fasta bad_sequence [email protected]

with the above sequence

Current Output:

[DEBUG] [ncbi_cds_from_protein.sequences]: Trimming CDS to Stockholm coordinates: 414..438
[WARNING] [ncbi_cds_from_protein.sequences]: Requested region 414..438 is outside CDS, skipping
[WARNING] [ncbi_cds_from_protein.scripts.ncfp]: Could not extract CDS for tr|A0A127QBK9|A0A127QBK9_9BURK/414-438 (skipping)
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Matched 0/1 records

Expected Output:

An attempt to match the protein ID AMP07459.1, followed by return of the underlying nucleotide sequence.

ncfp Version:

git HEAD

Python Version:

3.10

Operating System:

macOS

Drop terminal stop codons

Summary:

The complete nucleotide sequence retrieved from NCBI is written to the output, including any terminal stop codons.
These sequences often cannot be used for backthreading onto aligned protein sequences, because the cds and protein sequences differ due to the presence of terminal stop codons in the nucleotide sequence that are not present in the protein sequence.

Description:

A --drop_stop_codons flag could be added, and when used all terminal stop codons in the cds sequence are removed, so that the retrieved cds matches the protein codon sequence for backthreading. Otherwise additional parsing of the output is required when using the ncfp output for backthreading nucleotide sequences onto aligned protein sequences.

Current Output:

The only output is the complete nucleotide sequence.

Expected Output:

When using the flag --drop_stop_codons, terminal stop codons are removed from the end of each cds.

ncfp Version:

v0.2.0

`ncfp` not recovering all coding sequences from NCBI

Summary:

ncfp does not recover all coding sequences from NCBI, even if a coding sequence is available

Description:

The UniProt sequence below

>tr|F5NV06|F5NV06_SHIFL MliC domain-containing protein OS=Shigella flexneri K-227 OX=766147 GN=SFK227_1958 PE=4 SV=1
MKKLLIIILPVLLSGCSAFNQLVERMQTDTLEYQCDEKPLTVKLNNPCQEVSFVYDNQLL
HLKQGLSASGARYSDGIYVFWSKGEEATVYKRDRIVLNNCQLQNPQR

corresponds to the NCBI record

https://www.ncbi.nlm.nih.gov/protein/333018885

whose coding sequence is in the nucleotide accession

https://www.ncbi.nlm.nih.gov/nuccore/AFGY01000021.1

but in debug mode ncfp reports:

[DEBUG] [ncbi_cds_from_protein.sequences]: Guessing sequence type for tr|F5NV06|F5NV06_SHIFL...
[DEBUG] [ncbi_cds_from_protein.sequences]: ...guessed UniProt
[DEBUG] [ncbi_cds_from_protein.sequences]: Uniprot record has GN field: SFK227_1958
[DEBUG] [ncbi_cds_from_protein.sequences]: Recovered EMBL database record: AFGY01000021
[DEBUG] [ncbi_cds_from_protein.sequences]: Adding record tr|F5NV06|F5NV06_SHIFL to cache with query AFGY01000021
Process input sequences: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.12it/s]
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: 1 sequences taken forward with query
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Identifying nucleotide accessions...
Search NT IDs:   0%|                                                                                                                    | 0/1 [00:00<?, ?it/s][DEBUG] [ncbi_cds_from_protein.entrez]: Entry has nt query, using direct ESearch
[DEBUG] [ncbi_cds_from_protein.entrez]: ESearch query: ('AFGY01000021',)
Search NT IDs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.81it/s]
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Added 1 new UIDs to cache
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Collecting GenBank accessions...
Fetch UID accessions: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.24s/it]
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Updated GenBank accessions for 1 UIDs
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Fetching GenBank headers...
[DEBUG] [ncbi_cds_from_protein.entrez]: Found 1 UIDs with no GenBank headers
[DEBUG] [ncbi_cds_from_protein.entrez]: Checking EPost histories, batch size is 1
[DEBUG] [ncbi_cds_from_protein.entrez]: Found 1 EPost histories, fetching headers
[...]
DEBUG:ncbi_cds_from_protein.entrez:Parsed 1 records
Fetching GenBank headers: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.22s/it]
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Fetched GenBank headers for 0 UIDs
INFO:ncbi_cds_from_protein.scripts.ncfp:Fetched GenBank headers for 0 UIDs
[WARNING] [ncbi_cds_from_protein.scripts.ncfp]: No GenBank header downloads were required! (in cache?)
WARNING:ncbi_cds_from_protein.scripts.ncfp:No GenBank header downloads were required! (in cache?)
[...]
[WARNING] [ncbi_cds_from_protein.scripts.ncfp]: No record found for sequence input tr|F5NV06|F5NV06_SHIFL
WARNING:ncbi_cds_from_protein.scripts.ncfp:No record found for sequence input tr|F5NV06|F5NV06_SHIFL
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Matched 0/1 records
INFO:ncbi_cds_from_protein.scripts.ncfp:Matched 0/1 records

and the ncfp*.fasta output files are empty.

Reproducible Steps:

  1. Create an input file containing only the sequence above.
  2. Call ncfp on that input file, e.g. with ncfp --debug -l test.log -b 1 --keepcache test.fasta test_ncfp [email protected]

ncfp Version:

Commit 0f70697

Python Version:

Python 3.8

Operating System:

macOS

Error occuring when ncfp encounters sequences which have been removed from NCBI database

Summary: NCBI sequence identifiers which have been removed from the NCBI data base cause a "no link/record returned for: xyz".

Description:

When attempting to use the ncfp command to create a file containing the nucleotide sequences from amino acid sequences using .fasta files which contain both the amino acid sequence and its associated accession number, a set of errors are listed when the command attempts to obtain the nucleotide sequences. The first being an index error: list index out of range and the second being: NCFPEFetchException: no link / record returned for: xyz

Reproducible Steps:

The command written which achieved these errors is as follows:
ncfp SigR_500_aa_seqs.fasta \ ncfp_nucleo_seqs \ [email protected]

Current Output:

Process input sequences: 100%|██████████████████████████████████████████████████████████████████████| 500/500 [04:14<00:00, 1.96it/s]
Search NT IDs: 1%|▍ | 3/500 [00:05<16:09, 1.95s/it]
Traceback (most recent call last):
File "/home/lee/miniconda3/lib/python3.9/site-packages/ncbi_cds_from_protein/entrez.py", line 216, in search_nt_ids
idlist = [lid["Id"] for lid in result[0]["LinkSetDb"][0]["Link"]]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/lee/miniconda3/bin/ncfp", line 10, in
sys.exit(run_main())
File "/home/lee/miniconda3/lib/python3.9/site-packages/ncbi_cds_from_protein/scripts/ncfp.py", line 246, in run_main
addedrows, countfail = search_nt_ids(qrecords, cachepath, args.retries, disabletqdm=args.disabletqdm)
File "/home/lee/miniconda3/lib/python3.9/site-packages/ncbi_cds_from_protein/entrez.py", line 218, in search_nt_ids
raise NCFPEFetchException("No link/record returned for %s" % record.id)
ncbi_cds_from_protein.entrez.NCFPEFetchException: No link/record returned for WP_078606386.1

Expected Output:

Fasta file containing the nucleotide sequences returned from the amino acid sequences given.

Operating System: Linux (Linux for Windows, flavor Xubuntu)

CircleCI test error: Entrez calls failing

Summary:

CI tests are failing due to connection issues against NCBI/Entrez

Description:

CircleCI's routine testing threw an error with an Entrez call. This appears to be relevant to #17

============================= test session starts ==============================
platform linux -- Python 3.8.5, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /home/circleci/repo/venv/bin/python3
cachedir: .pytest_cache
rootdir: /home/circleci/repo
plugins: cov-2.12.1
collecting ... collecting 3 items                                                             collected 7 items                                                              

tests/test_cli_parsing.py::test_bad_infile PASSED                        [ 14%]
tests/test_cli_parsing.py::test_create_and_keep_cache FAILED             [ 28%]
tests/test_cli_parsing.py::test_download_and_log PASSED                  [ 42%]
tests/test_ncfp.py::test_basic_ncbi PASSED                               [ 57%]
tests/test_ncfp.py::test_basic_uniprot PASSED                            [ 71%]
tests/test_ncfp.py::test_basic_stockholm SKIPPED (Database caching n...) [ 85%]
tests/test_ncfp.py::test_small_stockholm PASSED                          [100%]
------------------------------ Captured log call -------------------------------
WARNING  ncbi_cds_from_protein.entrez:entrez.py:329 ELing query (XP_005274161.1) failed (retry 1)
Traceback (most recent call last):
  File "/home/circleci/repo/ncbi_cds_from_protein/entrez.py", line 324, in elink_fetch_with_retries
    Entrez.elink(dbfrom=dbname, linkname=linkdbname, id=query_id)
  File "/home/circleci/repo/venv/lib/python3.8/site-packages/Bio/Entrez/__init__.py", line 281, in elink
    return _open(cgi, variables)
  File "/home/circleci/repo/venv/lib/python3.8/site-packages/Bio/Entrez/__init__.py", line 606, in _open
    handle = urlopen(cgi)
  File "/usr/local/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/local/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/usr/local/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/local/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/local/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 429: Too Many Requests

----------- coverage: platform linux, python 3.8.5-final-0 -----------
Coverage XML written to file .coverage.xml

=========================== short test summary info ============================
FAILED tests/test_cli_parsing.py::test_create_and_keep_cache - ncbi_cds_from_...
============== 1 failed, 5 passed, 1 skipped in 78.42s (0:01:18) ===============

Exited with code exit status 1

CircleCI received exit code 1

Expected Output:

Tests should pass.

ncfp Version:

0.2.0

Python Version:

3.6, 3.7, 3.8

Operating System:

Linux (at CircleCI)

List index out of range when processing sequences

Summary:

An error is thrown when processing sequences:

[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Initialising cache...
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Setting up SQLite3 database cache at caches/new_ncfpnew/ncfpcache_ncfp_cache.sqlite3...
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Parsing sequence input...
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Reading sequences from ncfp_error_2.fasta
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: 8 sequence records read successfully from ncfp_error_2.fasta
[INFO] [ncbi_cds_from_protein.sequences]: Processing sequences...
1/5 Process input sequences:  25%|█████████████████████████████████████████████████▊                                                                                                                                                     | 2/8 [00:01<00:03,  1.83it/s]
Traceback (most recent call last):
  File "/opt/anaconda3/envs/ncfp_py311/bin/ncfp", line 33, in <module>
    sys.exit(load_entry_point('ncfp', 'console_scripts', 'ncfp')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lpritc/Documents/Development/GitHub/ncfp/ncbi_cds_from_protein/scripts/ncfp.py", line 333, in run_main
    qrecords, qskipped = process_sequences(
                         ^^^^^^^^^^^^^^^^^^
  File "/Users/lpritc/Documents/Development/GitHub/ncfp/ncbi_cds_from_protein/sequences.py", line 145, in process_sequences
    qstring = result.split("\n")[1].strip()[:-1]
              ~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Description:

Sequences for which ncfp should retrieve a corresponding nucleotide sequence are throwing an error.

Reproducible Steps:

With the file ncfp_error_2.fsata:

>tr|A0A369NYJ2|A0A369NYJ2_9ACTN/284-302 Ammonium transporter OS=Adlercreutzia equolifaciens subsp. celatus OX=394340 GN=C1850_06480 PE=3 SV=1 A0A369NYJ2_Helix_8
LVGAATGLVAGLVAITPAA
>tr|A0A3N0AQL8|A0A3N0AQL8_9ACTN/284-302 Ammonium transporter OS=Adlercreutzia equolifaciens subsp. celatus DSM 18785 OX=1121021 GN=DMP10_09225 PE=3 SV=1 A0A3N0AQL8_Helix_8
LVGAATGLVAGLVVITPAA
>tr|A0A6I8TEK7|A0A6I8TEK7_AEDAE/292-312 Ammonium transporter OS=Aedes aegypti OX=7159 GN=5569115 PE=3 SV=1 A0A6I8TEK7_Helix_8
IVDLINGILASLVSVTAGCFL
>tr|A0A6I8TEI4|A0A6I8TEI4_AEDAE/289-309 Ammonium transporter OS=Aedes aegypti OX=7159 GN=5569115 PE=3 SV=1 A0A6I8TEI4_Helix_8
IVDLINGILASLVSVTAGCFL
>tr|Q1L727|Q1L727_AEDAE/285-303 Ammonium transporter OS=Aedes aegypti OX=7159 PE=2 SV=1 Q1L727_Helix_8
IMNGVLASLVSVTGGCYLF
>tr|A0A1S4FGA3|A0A1S4FGA3_AEDAE/292-310 AAEL007377-PA OS=Aedes aegypti OX=7159 GN=AAEL007377 PE=4 SV=1 A0A1S4FGA3_Helix_8
IMNGVLASLVSVTGGCYLF
>tr|E2SD44|E2SD44_9ACTN/268-285 Ammonium transporter OS=Aeromicrobium marinum DSM 15272 OX=585531 GN=amt PE=3 SV=1 E2SD44_Helix_8
KATAVGAASGVVTGLVAI
>tr|E2SD33|E2SD33_9ACTN/260-277 Ammonium transporter OS=Aeromicrobium marinum DSM 15272 OX=585531 GN=amt PE=3 SV=1 E2SD33_Helix_8
LGAASGAIAGLVAVTPAA

and the command

ncfp --unify_seqid -s -v -l ncfp_error_2.log ncfp_error_2.fasta ncfp_error_2.out [email protected] -d caches/new_ncfpnew -c ncfp_cache

the error above is thrown

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.