GithubHelp home page GithubHelp logo

phagcn's Introduction

PhaGCN is a GCN based model, which can learn the species masking feature via deep learning classifier, for new Phage taxonomy classification. To use PhaGCN, you only need to input your contigs to the program.

ATTNETION!!!

  1. The program has been updated and move to (PhaBOX)[https://github.com/KennthShang/PhaBOX], which is more user-friendly. Hope you will enjoy it. Also, an idenpendent PhaGCN with latest ICTV can be found in PhaGCN_newICTV. But we still encourage you to use PhaBOX for your convenience. This folder will be no longer maintained.

  2. Our web server for phage-related tasks (including phage identification, taxonomy classification, lifestyle prediction, and host prediction) is available! You can visit [https://phage.ee.cityu.edu.hk/] to use the GUI. We also provided more detailed intermediate files and visualization for further analyzation.

Required Dependencies

An easiler way to install

We recommend you to install all the package with Anaconda

After cloning this respository, you can use anaconda to install the environment.yaml. This will install all packages you need with gpu mode (make sure you have installed cuda on your system).

  conda env create -f environment.yaml -n phagcn
  conda activate phagcn

If you want to use the gpu to accelerate the program:

  • cuda
  • Pytorch-gpu

Note: please install the pytorch with correct cuda version corresponding to your system

Usage (example)

Here we present an example to show how to run PhaGCN. We support a file named "contigs.fa" in the Github folder and it contain contigs simulated from E. coli phage. The only command that you need to run is python run_Speed_up.py --contigs contigs.fa --len 8000.

There are two parameters for the program: 1. --contigs is the path of your contigs file. 2. --len is the length of the contigs you want to predict. As shown in our paper, with the length of contigs increases, the recall and precision increase. We recommend you to choose a proper length according to your needs. The default length is 8000bp and the minimum length is 2000bp.

The output file is final_prediction.csv. There are three column in this csv file: "contig_name, median_file_name, prediction".

PhaGCN will only trained on the current given database. But you can update the database if required.

Notice

If you want to use PhaGCN, you need to take care of two things:

  1. Make sure all your contigs are virus contigs. You can separate bacteria contigs by using PhaMer
  2. The script will pass contigs with non-ACGT characters, which means those non-ACGT contigs will be remained unpredict.
  3. if the program output an error (which is caused by your machine): Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library. You can type in the command export MKL_SERVICE_FORCE_INTEL=1 before runing run_Speed_up.py

References

how to cite this tool:

Jiayu Shang, Jingzhe Jiang, Yanni Sun, Bacteriophage classification for assembled contigs using graph convolutional network, Bioinformatics, Volume 37, Issue Supplement_1, July 2021, Pages i25–i33, https://doi.org/10.1093/bioinformatics/btab293

Supplementary information

The supplementary file of the paper can be found in the supplementary folder.

Contact

If you have any questions, please email us: [email protected]

phagcn's People

Contributors

kennthshang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

phagcn's Issues

Upgrade the database

Hi,
I would like to know how to proceed to upgrade the database i.e being able to assign contigs to other order than caudoviridales ?

Thanks for the tool you developed.

Can it be used on human gut metagenome?

Hello, I would like to know if we can use this tool on Human gut metagenome as you tested this tool on oyster metagenome. I would like to know if database is appropriate for human gut.

Thank you.

Confidence scores for predictions

Hi there,

After trying out HostG (thanks for you help and responses with that!) I spotted this and thought it'd be great to try out on the same data.

Similar to my question with HostG, I was wondering if the taxonomy predictions have any sort of confidence score associated with them, and if it was possible to output this with final_prediction.csv at all?

Currently, I can't tell if the predictions are confident ones, or simply the "closest" match (but may have very low confidence for the match?). I was also wondering if different taxonomic ranks had different confidence scores associated with them (again, similar to our conversation in the HostG thread - thanks again for implementing that!).

Also, just a quick side note that I think MCL should also to be added to the dependency list in README.md? I had a failed attempt the first time and eventually noticed it failed from not having something from MCL loaded at the time...

Kind regards,
Mike.

Classify Phage - Genome Fractions

Hi,

I am classifying Phages according to their taxonomy. I am having the issue that my fast afiles have fractions of the genome instead of the whole genome:

My fasta files have the format
Genome1.fasta:

>k141_291006
TCG...
>k141_386008
TCG....

The PhaGCN classifies this phage genome as:

k141_291006,19687,Casjensviridae,0.17071722 
k141_386008,108404,Herelleviridae,1.0

So it gives two different classifications (yes one has probability 1.0 in other cases there isn't one with higher probability). The examples for this tool use whole genomes to classify the genome.

  1. Can I just concat the sequences and classify as an entire genome?
  2. Should I align the against each other using Multiple Sequence Alignment? I though to align against reference but it since I do not know to which family they belong finding an accurate reference genome is hard and not robust method.
  3. Should I classify according to the most probable classification and if they are all very similar like 0.5,0.4 do a consensus?

Best Regards and Thank You

Suggestions on reproducibility and portability

Hi Jiayu, I'd like to run PhaGCN and some other tools from your team. But I found it's not easy to set up.

For PhaGCN, there are some known issues:

  1. Before installation

     # HOWTO: edit environment.yaml: change value of 'name' and 'prefix'.
    
  2. Installation

     # ISSUE: nothing provides __glibc >=2.17,<3.0.a0 needed by cudatoolkit-11.1.1-h6406543_10
     # HOWTO: update mamba/conda first.
    
  3. Running

     error: python: relocation error: /lib64/libpthread.so.0: symbol __h_errno, version GLIBC_PRIVATE not defined in file libc.so.6 with link time referenc conda    
     # HOWTO: run the two lines below, or append the two lines below to ~/.bashrc or miniconda3/envs/phagcn/etc/profile.d/conda.sh
     export LD_PRELOAD=/lib64/libpthread.so.0
     export LD_PRELOAD=/lib64/libc.so.6
    
  4. Making the threads of DIAMOND configurable. Hard-coded parameters make the script less portable.

     Option 1: Manually edit '8' to other values
     Option 2: add an option --threads
    
     #     parser.add_argument('--threads', type=int, default=40)
     #     make_diamond_cmd = 'diamond makedb --threads {:d} --in database/Caudovirales_protein.fasta -d database/database.dmnd'.format(args.threads)
     #     diamond_cmd = 'diamond blastp --threads {:d} --sensitive -d database/database.dmnd -q database/Caudovirales_protein.fasta -o database/database.self-diamond.tab'.format(args.threads)
    
  5. Allowing run_Speed_up.py to be runnable in other paths, not just the project home. I tried to edit it but later found there is more than one file to edit. I'd suggest making the output and temporary files configurable via the command line option. And also let the python path find other scripts and database path automatically, so it could be used to process multiple input files simultaneously.

Wei

Addition of an output argument

Hello,

Thank you for developping PhaGCN, the tool was easy to install and super fast to run on viral sequences!
I however have some small suggestions in terms of usability of the tool:

  • If I am not mistaken, the tool runs and creates outputs directly in the script folder. It would be amazing to be able to specify where the outputs of the tools should be created.
  • Currently, when I need to re-running the tool several times on different datasets, I need to remove all the previous temporary outputs from the script folder before running the scripts.
  • Similarly on an HPC system, the tool cannot be parrallelized as currently implemented, because all jobs needs to write in the script folder.

My current implementation downloads the repo for each job, moves the contigs to classify in the script folder. Once the tool has run, I move outputs to the actual output location I need and delete the script repo. This is not unusable, but it probably could be improved.

Pip failed error

When i run "conda env create -f environment.yaml -n phagcn", an "Pip failed" error occurred as below, i upgraded the pip but same error occurred, do you have ideas how to fix it? Thanks.

Channels:

  • pytorch
  • conda-forge
  • bioconda
  • defaults
    Platform: linux-64
    Collecting package metadata (repodata.json): done
    Solving environment: done

Downloading and Extracting Packages:

Preparing transaction: done
Verifying transaction: done
Executing transaction: | By downloading and using the CUDA Toolkit conda packages, you accept the terms and conditions of the CUDA End User License Agreement (EULA): https://docs.nvidia.com/cuda/eula/index.html

done
Installing pip dependencies: \ Ran pip subprocess with arguments:
['/data5/baidefeng/miniconda3/envs/phagcn/bin/python', '-m', 'pip', 'install', '-U', '-r', '/data5/baidefeng/virus6/temp/PhaGCN/condaenv.zyqxs9c3.requirements.txt', '--exists-action=b']
Pip subprocess output:

Pip subprocess error:
ERROR: Exception:
Traceback (most recent call last):
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_internal/cli/base_command.py", line 180, in _main
status = self.run(options, args)
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_internal/cli/req_command.py", line 205, in wrapper
return func(self, options, args)
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_internal/commands/install.py", line 269, in run
session = self.get_default_session(options)
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_internal/cli/req_command.py", line 77, in get_default_session
self._session = self.enter_context(self._build_session(options))
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_internal/cli/req_command.py", line 87, in _build_session
session = PipSession(
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_internal/network/session.py", line 275, in init
self.headers["User-Agent"] = user_agent()
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_internal/network/session.py", line 132, in user_agent
linux_distribution = distro.linux_distribution() # type: ignore
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_vendor/distro.py", line 125, in linux_distribution
return _distro.linux_distribution(full_distribution_name)
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_vendor/distro.py", line 681, in linux_distribution
self.version(),
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_vendor/distro.py", line 741, in version
self.lsb_release_attr('release'),
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_vendor/distro.py", line 903, in lsb_release_attr
return self._lsb_release_info.get(attribute, '')
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_vendor/distro.py", line 556, in get
ret = obj.dict[self._fname] = self._f(obj)
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_vendor/distro.py", line 1014, in _lsb_release_info
stdout = subprocess.check_output(cmd, stderr=devnull)
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/subprocess.py", line 415, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '('lsb_release', '-a')' returned non-zero exit status 127.
Traceback (most recent call last):
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/main.py", line 31, in
sys.exit(_main())
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_internal/cli/main.py", line 71, in main
return command.main(cmd_args)
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_internal/cli/base_command.py", line 104, in main
return self._main(args)
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_internal/cli/base_command.py", line 221, in _main
self.handle_pip_version_check(options)
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_internal/cli/req_command.py", line 143, in handle_pip_version_check
session = self._build_session(
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_internal/cli/req_command.py", line 87, in _build_session
session = PipSession(
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_internal/network/session.py", line 275, in init
self.headers["User-Agent"] = user_agent()
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_internal/network/session.py", line 132, in user_agent
linux_distribution = distro.linux_distribution() # type: ignore
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_vendor/distro.py", line 125, in linux_distribution
return _distro.linux_distribution(full_distribution_name)
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_vendor/distro.py", line 681, in linux_distribution
self.version(),
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_vendor/distro.py", line 741, in version
self.lsb_release_attr('release'),
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_vendor/distro.py", line 903, in lsb_release_attr
return self._lsb_release_info.get(attribute, '')
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_vendor/distro.py", line 556, in get
ret = obj.dict[self._fname] = self._f(obj)
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/site-packages/pip/_vendor/distro.py", line 1014, in _lsb_release_info
stdout = subprocess.check_output(cmd, stderr=devnull)
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/subprocess.py", line 415, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/data5/baidefeng/miniconda3/envs/phagcn/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '('lsb_release', '-a')' returned non-zero exit status 127.

failed

CondaEnvException: Pip failed

conda install failed

[yu@headnode@PhaGCN]$ conda  create -n phagcn -f environment.yaml
Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - environment.yaml

Current channels:

  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

How to set conda channel in .condarc to get your packages.
Thanks!

Error Generating Knowledge graph

Hi Kenneth,

I tried running the :python run_Speed_up.py --contigs contigs.fa command and I got the following message error :
---------------------------Generating Knowledge graph---------------------------
Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library. Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it. GCN Error for file contig_0 cat: 'pred/*': No such file or directory Traceback (most recent call last): File "run_Speed_up.py", line 150, in <module> out = subprocess.check_call(cmd, shell=True) File "/home/user/anaconda3/envs/py3.6/lib/python3.6/subprocess.py", line 291, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'cat pred/* > final_prediction.csv' returned non-zero exit status 1.

Have you got an idea about how it could be fixed ?
Cheers,
Robby

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.