sejmodha / miscscripts Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 3.0 25 KB

License: GNU General Public License v3.0

Python 100.00%

miscscripts's People

Contributors

Stargazers

Watchers

Forkers

ziels flopezo sofsta

miscscripts's Issues

Function to download latest version human reference genome

Hi,

I am not sure if this repository is up to date. However, the link

ftp://ftp.ncbi.nih.gov/genomes/refseq/assembly_summary_refseq.txt

needs to be updated.

The output file human_genome.fa is empty.

non-GNU sed

Thank you for putting this script together! To overcome a KeyError: 'taxid' if you are using a Mac, simply change the subprocess.calls in line 79 and 80 from

subprocess.call("sed -i '1d' assembly_summary_refseq.txt", shell=True) subprocess.call("sed -i 's/^# //' assembly_summary_refseq.txt", shell=True)

subprocess.call("sed -i.bu '1d' assembly_summary_refseq.txt", shell=True) subprocess.call("sed -i.bu 's/^# //' assembly_summary_refseq.txt", shell=True)

The addition of the -i.bu creates a backup and operates sed "in place" something that is not standard on a Mac non-GNU version of sed.

UpdateKrakenDatabases.py how to set target download location

Hi sejmodha,

Great work for putting together this script! This might be a noob question, but you mention that this script takes an optional command-line argument which can be specified as the target location where the data should be downloaded and saved.

How do I do that? Can you perhaps give an example?

Cheers,

Sam

Error in using the database

Hi @sejmodha . I used your script to successfully setup the kraken database

But when I run I get this error
kraken: database ("/opt/apps/bioinfo/databases/kraken") does not contain necessary file database.kdb

This is the code I used
DB=/opt/apps/bioinfo/databases/kraken
kraken --preload --db $DB sample.trimmed.fa --threads 15 --classified-out sample.classified --unclassified-out sample.unclassified > sample.kraken

Please advice

UpdateKrakenDatabases.py kraken input format error

Hi
I got the following error while trying to convert sequences to kraken input format

Traceback (most recent call last):
File "/home/jason/Documents/Databases/UpdateKrakenDatabases.py", line 118, in
get_fasta_in_kraken_format('human_genome.fa')
File "/home/jason/Documents/Databases/UpdateKrakenDatabases.py", line 107, in get_fasta_in_kraken_format
outseq=">"+seq_id+"|"+taxid+"\n"+str(seq)+"\n"
File "/home/jason/anaconda3/lib/python3.9/site-packages/Bio/Seq.py", line 369, in str
return self._data.decode("ASCII")
File "/home/jason/anaconda3/lib/python3.9/site-packages/Bio/Seq.py", line 156, in decode
return bytes(self).decode(encoding)
File "/home/jason/anaconda3/lib/python3.9/site-packages/Bio/Seq.py", line 2911, in bytes
raise UndefinedSequenceError("Sequence content is undefined")
Bio.Seq.UndefinedSequenceError: Sequence content is undefined

I would apprecciate any help since I'm stuck trying to build the database

pandas error

Thanks for the script, but getting this error:

updatekrakendb.py
/media/disk1_12TB/tallnut/db/kraken2/refseq
/media/disk1_12TB/tallnut/db/kraken2/refseq
Downloading human genome

--2021-02-18 10:07:12-- ftp://ftp.ncbi.nih.gov/genomes/refseq/assembly_summary_refseq.txt
=> ‘assembly_summary_refseq.txt’
Resolving ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)... 165.112.9.228, 2607:f220:41e:250::11, 2607:f220:41e:250::13, ...
Connecting to ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)|165.112.9.228|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /genomes/refseq ... done.
==> SIZE assembly_summary_refseq.txt ... 66360144
==> PASV ... done. ==> RETR assembly_summary_refseq.txt ... done.
Length: 66360144 (63M) (unauthoritative)

assembly_summary_refseq.txt 100%[=================================================================================================>] 68.47M 1.13MB/s in 97s

2021-02-18 10:08:54 (724 KB/s) - ‘assembly_summary_refseq.txt’ saved [71801632]

/home/tallnutt/d/scripts/updatekrakendb.py:81: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
assembly_sum = pd.read_table('assembly_summary_refseq.txt',dtype='unicode')
Traceback (most recent call last):
File "/home/tallnutt/d/scripts/updatekrakendb.py", line 116, in
download_refseq_genome(9606,'human_genome_url.txt')
File "/home/tallnutt/d/scripts/updatekrakendb.py", line 81, in download_refseq_genome
assembly_sum = pd.read_table('assembly_summary_refseq.txt',dtype='unicode')
File "/opt/miniconda3/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/opt/miniconda3/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 435, in _read
data = parser.read(nrows)
File "/opt/miniconda3/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1139, in read
ret = self._engine.read(nrows)
File "/opt/miniconda3/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1995, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 955, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2172, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 22 fields in line 141712, saw 36

Plasmid database

Hi @sejmodha
I realized the database name is HumanVirusBacteria. Does it mean that only these three are in the database? What about plasmids?

sejmodha / miscscripts Goto Github PK

miscscripts's People

Contributors

Stargazers

Watchers

Forkers

miscscripts's Issues

Function to download latest version human reference genome

non-GNU sed

UpdateKrakenDatabases.py how to set target download location

Error in using the database

UpdateKrakenDatabases.py kraken input format error

pandas error

Plasmid database

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs