wtmatlock / flanker Goto Github PK

Gene-flank analysis tool

License: MIT License

Python 100.00%

flanker's Introduction

This repository accompanies the paper Matlock W, Lipworth S, Constantinides B, Peto TEA, Walker AS, Crook D, Hopkins S, Shaw LP, Stoesser N. Flanker: a tool for comparative genomics of gene flanking regions. Microbial Genomics. 2021. doi: https://doi.org/10.1099/mgen.0.000634

Installation and documentation

Read the Docs

Reproducibility

You can reproduce the analysis performed in our manuscript and the figures in the Binder environment below (click on the Binder badge, N.B. it is a bit slow to load - once the environment is loaded open the flanker.rmd file).

Tests

License

Flanker is available under the MIT License.

flanker's People

Contributors

Stargazers

Watchers

Forkers

samlipworth embatty amjad2212 oxfordmmm gilmahu marcosquintelab

flanker's Issues

Choose abricate db

add argument to choose abricate db e.g. resfinder, ncbi etc.

Reverse complements

When a gene is annotated, flanker needs to check whether it's the reverse complement (in the Abricate output), and if so, flip it with Biopython

Question about Flanker Docs

Hello,

https://flanker.readthedocs.io/en/latest/

I am having difficulty understanding some of the information in the documentation. In order to fully understand and reproduce the information, could you please provide more details about the *fsa files shown below?

cat *fsa > david_plasmids.fasta

I am wondering if the david_plasmids.fasta file and the file located at https://github.com/wtmatlock/flanker/blob/main/flanker/tests/data/test.fasta are nearly identical?

I came across a small typo in the command line argument for Flanker in the documentation. In the following command:

flanker --flank upstream --window 0 --wstop 5000 --wstep 100 --gene blaKPC-2 --fasta_file david_plasmids.fasta --include_gene

The "--wstop" and "--wstep" options should be written with a single dash instead of a double dash, as follows:

flanker --flank upstream --window 0 -wstop 5000 -wstep 100 --gene blaKPC-2 --fasta_file david_plasmids.fasta --include_gene

Using double dash instead of single dash with these options results in an error, as shown below:

usage: flanker [-h] -i FASTA_FILE (-g GENE [GENE ...] | -log LIST_OF_GENES)
               [-cm] [-f FLANK] [-m MODE] [-circ] [-inc] [-db DATABASE]
               [-v [VERBOSE]] [-w WINDOW] [-wstop WINDOW_STOP]
               [-wstep WINDOW_STEP] [-cl] [-o OUTFILE] [-tr THRESHOLD]
               [-p THREADS] [-k KMER_LENGTH] [-s SKETCH_SIZE]
flanker: error: unrecognized arguments: --wstop 5000 --wstep 100

Could you please provide the command to produce the out* files shown below?

cat out* | sed '/assembly/d'  > all_out

Optionally include the gene

My version included the gene -yours doesn't. I'm not completely sure which is better - maybe we should include as an option?

Citation info

Add citation info

using draft assemblies?

Dear William,
Thanks for developing this awesome tool, I'm sure it will prove very useful for a lot of researchers.

I'm evaluating using Flanker for my own research, but I only posses short-read sequencing data. Therefore, I wanted to ask you: Have your tried the tool with draft assemblies? If so, how did it perform?

Thanks for your time. I'll be waiting for your answer

Cheers,

Parallelise abricate when supplied multifasta

Discussed briefly with Sam, could probably chunk input fastas and process separately in their own tempdir

More robust querying

Implement a closest match mode for gene queries with some string edit distance

Dependency problems since Conda update

Hi,

I've previously used Flanker without issues but recently had to re-install it. Since then I have kept having the same error message despite running the same commands that worked successfully previously. Any tips much appreciated:

Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/flanker/bin/flanker", line 8, in
sys.exit(main())
File "/home/ubuntu/miniconda3/envs/flanker/lib/python3.7/site-packages/flanker/flanker.py", line 618, in main
flanker_main()
File "/home/ubuntu/miniconda3/envs/flanker/lib/python3.7/site-packages/flanker/flanker.py", line 570, in flanker_main
args.fasta_file, args.window, gene.strip())
File "/home/ubuntu/miniconda3/envs/flanker/lib/python3.7/site-packages/flanker/flanker.py", line 421, in flank_fasta_file_lin
data = pd.read_csv(unfiltered_abricate_file, sep='\t', header=0)
File "/home/ubuntu/miniconda3/envs/flanker/lib/python3.7/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/flanker/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/ubuntu/miniconda3/envs/flanker/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 482, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/ubuntu/miniconda3/envs/flanker/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 811, in init
self._engine = self._make_engine(self.engine)
File "/home/ubuntu/miniconda3/envs/flanker/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "/home/ubuntu/miniconda3/envs/flanker/lib/python3.7/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 69, in init
self._reader = parsers.TextReader(self.handles.handle, **kwds)
File "pandas/_libs/parsers.pyx", line 549, in pandas._libs.parsers.TextReader.cinit
pandas.errors.EmptyDataError: No columns to parse from file

Abricate query matching

Force an exact match instead of contains when looking for gene, as e.g. CTX-M-5 and CTX-M-55 would be an issue

Error if gene is not annotated (gene_sense)

Error from line gene_sense=str(gene_sense['STRAND'].iloc[0]) is given in both linear and circ modes if gene is not annotated

CI with Github Actions

https://docs.github.com/en/actions/managing-workflow-runs/adding-a-workflow-status-badge

Gene query naming

Since moving to a match vs. contains for gene annotation query, you have to add Abricate suffixes to get a hit i.e. need blaCTX-M-55_1 instead of blaCTX-M-55 - i think we need to find a new solution

Improve testing

Add tests to verify output as well

Help by default is broken

python flanker.py

gives:

Traceback (most recent call last): File "flanker.py", line 169, in <module> main() File "flanker.py", line 160, in main run_abricate(args.fasta_file) File "flanker.py", line 36, in run_abricate p = subprocess.Popen(abricate_command, stdout = subprocess.PIPE, stderr = subprocess.PIPE) # run abricate File "/home/sam/miniconda3/lib/python3.7/subprocess.py", line 800, in __init__ restore_signals, start_new_session) File "/home/sam/miniconda3/lib/python3.7/subprocess.py", line 1482, in _execute_child restore_signals, start_new_session, preexec_fn) TypeError: expected str, bytes or os.PathLike object, not NoneType

Flanker doesn't handle multi-copy genes

Need a fix, e.g. two copies of same gene on chromosome in different locations

Fasta header naming

Add a step to flanker (or separate tool) which renames fasta headers to their file name (to remove this requirement)

Failed to install Flanker

I followed the instructions in https://flanker.readthedocs.io/en/latest/#installation, but I encountered an error while trying to install Flanker. Below are the commands I entered and the corresponding error messages:

$conda install mamba=1.1.0 -n base -c conda-forge

Proceed ([y]/n)? 

Downloading and Extracting Packages
openssl-3.1.0        | 2.5 MB    | ##################################################################### | 100% 
mamba-1.1.0          | 62 KB     | ##################################################################### | 100% 
conda-23.3.1         | 1.2 MB    | ##################################################################### | 100% 
ruamel.yaml.clib-0.2 | 131 KB    | ##################################################################### | 100% 
jsonpatch-1.32       | 14 KB     | ##################################################################### | 100% 
ruamel.yaml-0.17.21  | 254 KB    | ##################################################################### | 100% 
packaging-23.1       | 45 KB     | ##################################################################### | 100% 
jsonpointer-2.0      | 9 KB      | ##################################################################### | 100% 
libmamba-1.1.0       | 1.4 MB    | ##################################################################### | 100% 
certifi-2022.12.7    | 147 KB    | ##################################################################### | 100% 
libmambapy-1.1.0     | 306 KB    | ##################################################################### | 100% 
boltons-23.0.0       | 296 KB    | ##################################################################### | 100% 
pluggy-1.0.0         | 16 KB     | ##################################################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Retrieving notices: ...working... done



mamba create -n flanker_env -c bioconda python=3.7 abricate=1.0.1 pandas=1.2 biopython=1.78  mash=2.2.2 networkx=2.5 python-levenshtein=0.12.2



            |/
        ███╗   ███╗ █████╗ ███╗   ███╗██████╗  █████╗
        ████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
        ██╔████╔██║███████║██╔████╔██║██████╔╝███████║
        ██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
        ██║ ╚═╝ ██║██║  ██║██║ ╚═╝ ██║██████╔╝██║  ██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═════╝ ╚═╝  ╚═╝

        mamba (1.1.0) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

█████████████████████████████████████████████████████████████


Looking for: ['python=3.7', 'abricate=1.0.1', 'pandas=1.2', 'biopython=1.78', 'mash=2.2.2', 'networkx=2.5', 'python-levenshtein=0.12.2']

bioconda/linux-64                                           Using cache
bioconda/noarch                                             Using cache
conda-forge/linux-64                                        Using cache
conda-forge/noarch                                          Using cache
r/linux-64                                                  Using cache
r/noarch                                                    Using cache
pkgs/main/noarch                                              No change                                         
pkgs/main/linux-64                                            No change                                         
pkgs/r/linux-64                                               No change                                         
pkgs/r/noarch                                                 No change                                         
ursky/linux-64                                                No change                                         
ursky/noarch                                                  No change                                         
Could not solve for environment specs
Encountered problems while solving:
  - package biopython-1.78-py39h7f8727e_0 is excluded by strict repo priority
  - package python-levenshtein-0.12.2-py39h27cfd23_0 is excluded by strict repo priority

The environment can't be solved, aborting the operation

$conda create -n flanker_env -c bioconda python=3.7 abricate=1.0.1 pandas=1.2 biopython=1.78  mash=2.2.2 networkx=2.5 python-levenshtein=0.12.2

Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: |
Found conflicts! Looking for incompatible packages.
Examining conflict for pandas abricate python python-levenshtein networkx mash biopython: 

*snip* ]

Package setuptools conflicts for:
networkx=2.5 -> setuptools
pandas=1.2 -> numexpr[version='>=2.6.8'] -> setuptools
python=3.7 -> pip -> setuptools
pandas=1.2 -> setuptools[version='<60.0.0']

Package libstdcxx-ng conflicts for:
pandas=1.2 -> libstdcxx-ng[version='>=7.3.0|>=7.5.0|>=9.3.0']
pandas=1.2 -> numpy[version='>=1.19.5,<2.0a0'] -> libstdcxx-ng[version='>=10.3.0|>=12|>=9.4.0|>=11.2.0|>=4.9|>=7.2.0']The following specifications were found to be incompatible with your system:

  - feature:/linux-64::__glibc==2.17=0
  - biopython=1.78 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
  - mash=2.2.2 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
  - pandas=1.2 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
  - python-levenshtein=0.12.2 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - python=3.7 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']

Your installed version is: 2.17

Note that strict channel priority may have removed packages required for satisfiability.

$uname -a
Linux bias5-login 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Left/right renaming

Rename left/right to downstream/upstream

only run abricate once 2

This feature needs to be extended so that abricate isn't re-run for each gene of interest. allow users to submit a list of genes.

Default flank behaviour for reverse complemented sequences

Hi all,

Thanks for the tool, very useful (and timely!) for my current project. Unfortunately I've run into a problem that I was hoping you could help me with.

I have a set of outbreak plasmids that are basically identical. However, when I ran Flanker it told me I had different flanking regions. It looked like an issue based on some of the plasmid sequences being reverse complemented, so I did some tests:

cpe059 and cpe061 have NDM on the positive strand
cpe058 has it on the negative strand
All three have the same upstream and downstream sequence (so should be clustered)

flanker -i NDM_plasmids/cpe059_61_58.plasmids.fasta -cl -g blaNDM-1 -circ -f upstream -w 5000 -o flanker_test_up

assembly_1,cluster
cpe058_contig_2_np1212.fasta_blaNDM-1_5000_upstream_flank.fasta,0
cpe061_contig_2_np1212.fasta_blaNDM-1_5000_upstream_flank.fasta,0
cpe059_contig_1_np1212.fasta_blaNDM-1_5000_upstream_flank.fasta,0

flanker -i NDM_plasmids/cpe059_61_58.plasmids.fasta -cl -g blaNDM-1 -circ -f downstream -w 5000 -o flanker_test_down

assembly_1,cluster
cpe059_contig_1_np1212.fasta_blaNDM-1_5000_downstream_flank.fasta,0
cpe058_contig_2_np1212.fasta_blaNDM-1_5000_downstream_flank.fasta,0
cpe061_contig_2_np1212.fasta_blaNDM-1_5000_downstream_flank.fasta,0

flanker -i NDM_plasmids/cpe059_61_58.plasmids.fasta -cl -g blaNDM-1 -circ -w 5000 -o flanker_test_both

assembly_1,cluster
cpe061_contig_2_np1212.fasta_blaNDM-1_5000_both_flank.fasta,0
cpe059_contig_1_np1212.fasta_blaNDM-1_5000_both_flank.fasta,0
cpe058_contig_2_np1212.fasta_blaNDM-1_5000_both_flank.fasta,1

It looks like when using the default -f ('both') it's treating the reverse complemented sequence differently. I was wondering if it had something to do with these lines (is it changing args.flank from 'both' to 'upstream' for reverse complemented genes?):

flanker/flanker/flanker.py

Lines 372 to 378 in cc82340

 if gene_sense == '-': 

 #record.seq = record.seq.reverse_complement() 

 if args.flank == 'upstream': 

 x = 'downstream' 

 else: 

 x = 'upstream'

Thanks for your help 🙏

Optionally run until no more clusters found

Per Nicole's idea - Possibly we might even have a metric for this combining number of isolates clusters and length...?

Multiple annotations of same gene

Give warning if multiple annotations are found for the same gene

Moving to BLAST

Moving Flanker from Abricate to BLAST

Deal with messy multi fasta headers

Currently if the multifasta header does not match the filename flanker will not work. We have provided a script as a workaround for the moment but this should be dealt with in the next release.

FASTA files of flanking sequences are deleted by default in clustering mode

By default, all of the FASTA files containing flanking sequences are removed after the clustering step in clustering mode (-cl, --cluster). It would be best to keep these files by default, and provide the user with an option to remove them following clustering, if desired; otherwise, the user has to run Flanker twice to obtain both clusters and the corresponding flanking sequences.

ERROR: Gene not found in sequence. Can i add custom database? . short contigs and --window option

I am very interested in using this tool for my analysis, I have 3 doubts.
(1) I tried to use the command "$ flanker --flank both -w 6000 --gene emr --fasta_file all_genomas.fasta --include_gene" in my analysis, but it did not return anything, just an error:

Error: Gene erm not found in 43770.1000.con.0010
Traceback (most recent call last):
File "/home/michel/miniconda3/envs/flanker/bin/flanker", line 10, in <module>
sys.exit(main())
File "/home/michel/miniconda3/envs/flanker/lib/python3.7/site-packages/flanker/flanker.py", line 383, in main
flanker_main()
File "/home/michel/miniconda3/envs/flanker/lib/python3.7/site-packages/flanker/flanker.py", line 345, in flanker_main
flank_fasta_file_lin(args.fasta_file, args.window,gene.strip())
File "/home/michel/miniconda3/envs/flanker/lib/python3.7/site-packages/flanker/flanker.py", line 266, in flank_fasta_file_lin
gene_sense=str(gene_sense['STRAND'].iloc[0])
File "/home/michel/miniconda3/envs/flanker/lib/python3.7/site-packages/pandas/core/indexing.py", line 895, in getitem
return self._getitem_axis(maybe_callable, axis=axis)
File "/home/michel/miniconda3/envs/flanker/lib/python3.7/site-packages/pandas/core/indexing.py", line 1501, in _getitem_axis
self._validate_integer(key, axis)
File "/home/michel/miniconda3/envs/flanker/lib/python3.7/site-packages/pandas/core/indexing.py", line 1444, in _validate_integer
raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds

It is probably because I use all my genomes (fragmented or not) are in a concatenated file. It throws an error when it cannot find the gene in some genome.
Is there a way to extract the contigs that have a gene from the genomes of interest?
It seems that we must do a previous analysis of abricate to know which contig contains the gene of interest.
(2) The second question is: if flanker uses abricate databases, then I can add more databases to abricate, for example a custom database or a database such as ISfinder or Megares,
Is the latter true? Does Flanker work this way?

It would be interesting, that Flanker has an option to add a sequence (fasta), then Flanker looks for neighboring genes to the added sequence.

(3) since genomes generally do not always have a high quality of assembly, contigs can be short, long and medium. If there is a gene of interest in a short contig, then I specify the --window 6000 option, but there is no 6000 bp on either side of the sequence. What about this?
I appreciate the answers in advance.
Thanks a lot

Unable to update Abricate; issues after trying

Edit: I was able to solve this problem with help from our IT administrator.

I was originally able to set up and use abricate using these instructions from the main page:

conda install -c conda-forge -c bioconda -c defaults abricate
abricate --check
abricate --list

However, I noticed that the databases said 2018. I tried to update them using the command on the main page,

~/02_21_22_abricate$ abricate-get_db --db ncbi --force
Can't locate Bio/Seq.pm in @inc (you may need to install the Bio::Seq module) (@inc contains: /data/home/user/anaconda3/lib/site_perl/5.26.2/x86_64-linux-thread-multi /data/home/user/anaconda3/lib/site_perl/5.26.2 /data/home/user/anaconda3/lib/5.26.2/x86_64-linux-thread-multi /data/home/user/anaconda3/lib/5.26.2 .) at /home/user/anaconda3/bin/abricate-get_db line 3.
BEGIN failed--compilation aborted at /home/user/anaconda3/bin/abricate-get_db line 3.

I had the same issue both on our lab's server and on my laptop running ubuntu.

On my laptop using Ubuntu, I installed Bio::Seq. I was still running into errors so I tried the advice on this issue page:

tseemann/abricate#174

The problem was solved as follows:
conda update -c conda-forge -c bioconda -c defaults abricate

If it works, be sure to update all databases
abricate --setupdb

abricate --list

Unfortunately, I now get this response:

abricate --check

Can't locate List/MoreUtils.pm in @INC (you may need to install the List::MoreUtils module) (@INC contains: /data/home/user/anaconda3/lib/site_perl/5.26.2/x86_64-linux-thread-multi /data/home/user/anaconda3/lib/site_perl/5.26.2 /data/home/user/anaconda3/lib/5.26.2/x86_64-linux-thread-multi /data/home/user/anaconda3/lib/5.26.2 .) at /home/user/anaconda3/bin/abricate line 10.
BEGIN failed--compilation aborted at /home/user/anaconda3/bin/abricate line 10.

That message is from our server, but I get the same one (with different locations) on my laptop with Ubuntu. I added the List::MoreUtils module but I am still getting the same error.

Has anyone seen or solved this issue?

Conda installation broken by abricate dependency

Environment seems unsolvable due to Abricate

% conda create -n test python=3 abricate
Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: -

default arguments

Add default arguments e.g. -w 1000
Display these values when -h is called

Don't re-run abricate for each window length

No need to re-run abricate for every new window length, this scales horribly.. allow user to specific windows to run as input and only run abricate once.

	if gene_sense == '-':

	#record.seq = record.seq.reverse_complement()
	if args.flank == 'upstream':
	x = 'downstream'
	else:
	x = 'upstream'