dominikbuchner / boldigger Goto Github PK

View Code? Open in Web Editor NEW

29.0 2.0 4.0 364 KB

A python package to query different databases of boldsystems.org

License: MIT License

Python 100.00%

metabarcoding boldsystems identification-engine

boldigger's People

Contributors

Stargazers

Watchers

Forkers

hoelzlfr personal1st gjdury

boldigger's Issues

error "Unable to complete operation on element with key out"

Hey Dominik
I am unable to process my data as it shows an error "Unable to complete operation on element with key out"
Please suggest the possible solution for this.

With regards
Rishikesh Krishan Laxmi

Best fitting hit incorrect

Hi,

I used classificaiton for COI and then ran the program to find the best fitting hit both with the JAMP and the BOLDigger method. I found a case where both BOLDigger and the JAMP method choose a hit even though another publicly available hit with better taxonomic classification and similarity was found.

Please check the entry for ASV308 in the attached file. BOLDigger and JAMP choose a published hit with a classification to class level and a similarity of 95.12. However, in the sheet showing the 20 best hits, there are several published hits with higher similarity and better classification. The top hit has a similarity of >97% and classification down to species level (although given the similarity value, only genus level callsification should be trusted, obviously).

BOLDResults_COI_cluster_reps_curated_no_contam.xlsx

A bug?

Cheers

Nauras

Passing literal html to 'read_html' is deprecated and will be removed in a future version.

Hello @DominikBuchner,

Thank you for creating this awesome tool!

Just to note that I've received the following warning when running BOLDIGGER via the command line (boldigger-cline ie_coi). It seems to run fine at this point, but might become an issue in the future?

boldigger/boldblast_coi.py:69: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.

Thanks,
Gert-Jan

Filter possible NUMTS from BOLDigger assignments

It is possible that in the metabarcoding amplification process not only the mitochondrial gene but also nuclear copies of that gene were amplified. And they can lead to false positive detection and identification in the BOLDigger and JAMP pipeline.

These copies are marked as UNVERIFIED by NCBI in its database (GenBank) if it detects internal codons or INDELs (in the internal sequences of genes where there should not be any).

It could be not so hard to incorporate into BOLDigger the detection of these sequences by adding a new FLAG if internal stopcodons are detected in the sequences after assigning them (given that the stopcodons depend on the taxonomic group to which each OTU belongs).

Thank you very much for this wonderful pipeline, best regards!

BOLD Not Responding: Attempting Retry

Description

The messages displayed in the screenshot (attached) are repeating every second. I'm uncertain on how long I should wait for the messages to load efficiently. Do you have any insights on this situation?

I am encountering the following log messages:

2023-01-29 19:04:06.902 python3.9[50142:2472810] +[CATransaction synchronize] called within transaction
2023-01-29 19:04:26.391 python3.9[50142:2472810] +[CATransaction synchronize] called within transaction

Additionally, I've included the script below as an example of my "*.fasta" file. The file contains information, followed by quality score, and actual sequence data shortened as "SEQUENCESEQUENCE", original lenght of DNA seq is like 190 bp.

My original fasta file contains many entries, but for now, I'm testing with just like 10 lines. Can you suggest a specific fasta file to use following a step such as Obitools?

***:50:HL725AFX3:1:11101:7839:1038 count=1128; obiclean_count={'XXX': 1296}; obiclean_head=True; obiclean_cluster={'XXX': 'NB552469:50:HL725AFX3:1:11101:7839:1038'}; obiclean_internalcount=0; obiclean_status={'XXX': 'h'}; obiclean_samplecount=1; obiclean_headcount=1; obiclean_singletoncount=0; 1:N:0:ATTGTAAT+NAACCGCG
SEQUENCESEQUENCE...

Screenshots

Environment

python3.9

Thanks for your time.
Onur.

ID description

Hi Dominik,

I'm working with BOLDigger v.2.0.3 (and boldigger-cline v.1.0.0) and have the issue that the "ID" in the results file starts with the greater-than (">") symbol (inherited from the fasta file description line). The fasta file is generated with Qiime2 and the ">" symbol unfortunately causes an error when using the BOLDigger taxonomy table together with the Qiime2 read table in TaxonTableTools.
So a fix of this issue would be quite helpful to streamline the usage.

Thanks!
Sascha

BOLDResults_test-dna-sequences_part_1.xlsx
test-dna-sequences.txt

Include BOLD API extra data for corrected list of top hits?

Hello!

I've been using BOLDigger for a while and I am so appreciative to have this tool at my fingertips! I'm recently exploring the option to generate a top hit list using the BOLDigger settings with the associated flags and correcting identifications using the BOLD API. I really like that the program has built in the functions to look at conflicting taxonomy and missing species names among the closely related (>98% similarity) top hits.

However, I'm finding that the loss of additional data for these lists (like the process ID, BIN, location, of the top hit is unfortunate as those pieces can really help guide the exploration of taxonomic assignments for my data.

So, that said, this isn't really an "issue" or a "bug", but rather a request to add this feature. Is this possible at all?

Thanks again for creating such an awesome program!
Monica

Private entry as top hit despite 100% similarity published entry available

Hi,

I am running BOLDigger v2.2.0 on the COI sequences in the txt file attached.

I am performing all the steps:

Running the identification engine with a batch size of 5.
Searching additional data
Adding top hits with BOLDigger method.
Download additional identification data via BOLD API for correction.

I encountered a weird behavior for MOTU23. See the top 20 hits for this MOTU:

BOLDigger hit sheet:

BOLDigger hit - API corrected sheet:

Why does BOLDigger ends up with a private entry as the chosen hit? Why doesn't it choose a published entry? Why doesn't it choose Botrylloides leachi(i) as the top hit?

Sequences attached for you to reproduce the problem.

Cheers

nauras
seqs_test.txt

BOLD did not respond! several weeks?

Dear Dominik and BOLDigger users,

I wonder if you also experience the same cycle of:

10:19:50: Requesting BOLD. This will take a while.
10:19:51: Downloading results.
10:19:51: Parsing html.
10:19:51: BOLD did not respond! Retrying.

these days. I thought it was due to the BOLD maintenance some days ago, but perhaps not?

I tried decreasing the batch size to 10 and then to 1, which was finally successful, but after a while it got stuck again.
Now I also noticed that the single sequence in *_done.fasta is incomplete and its last line remained in the original fasta file (attached). Maybe here we have a new format error to include in the checks?
The fasta file passed the BOLDigger check and when I copied the sequences directly into ID engine on BOLD website, I got results normally.

Thanks,
Ondrej

P.S.:
With this issue I realized how wonderful it would be if BOLDigger did not crash every time I close its process window or when the pc goes to sleep unintentionally. This way I have to enter the folder, file and account info again and again (for some reason Ctrl+C works only once there or not even once in case I click elsewhere before pasting :D )
boldigger_issue.zip

windows: crash fixed by updating beautifulsoup4

BOLDigger crashes on windows 10, claiming Pandas needs beautifulsoup4 v 4.11.1 or newer (my machine had 4.10 installed).

Fixed by forcing an upgrade of bs4 using:
pip install --upgrade beautifulsoup4

Upload fasta with more than 100 seqs?

Hi @DominikBuchner

very cool tool and very easy to use. fantastic. really appreciate it.

The documentation didn't really say, but is your tool actually meant to be used with fastas containing more than 100 sequences and the algorithm automatically plits them into batch sizes of 100 seqs max, or does it work with fastas containing a maximum of 100 seqs?

cheers
Nauras

BOLD Not Responding: Attempting Retry

Hi, I run the rbcl analysis on BOLDigger with 3 batch size (since it is reported that it should be less than 5). However, I encountered an error with BOLD did not respond! Retrying. How can I solve this issue?

Thanks for your time.
Anil

I added my fasta file if you want to look at it.
M6_1-2-3_C10r005_twoline.zip

Correction of top hits via BOLD API not initiated

Hi @DominikBuchner

The correction of the top hits is exactly what I need. There are quite a few cases where the top hit has no species level assignment and higher resolution assignment is masked (or "hidden" as you called it).

The problem is that the correction is njust not being initiated. I am using the GUI for the BOLD API correction (as it doesn't seem to be implemented yet in the command line version). So I did my identification, I downloaded additional data and got my BOLigger top hits. All good. I checked the fasta again, boldigger says it's fine. But when Is elect my results file and the corresponding fasta at the bottom of the GUI and click run, the small window that always pops up when a boldigger process is being initiated and which shows what's going on pops up for like half a second, and then boldigger shuts down the GUI and and boldigger.exe window disappear.

Any idea what is going on here? I would really like to use it as It's a pain to sort this out manually.

Cheers

nauras

Mac OSX tkinter.TclError - fix inside

Hi
On some OSX systems the GUI does not start because of an tkinter error:

Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/site-packages/boldigger/main.py", line 119, in
main()
File "/usr/local/lib/python3.7/site-packages/boldigger/main.py", line 47, in main
event, values = window.read(timeout = 100)
File "/usr/local/lib/python3.7/site-packages/PySimpleGUI/PySimpleGUI.py", line 6957, in Read
results = self._read(timeout=timeout, timeout_key=timeout_key)
File "/usr/local/lib/python3.7/site-packages/PySimpleGUI/PySimpleGUI.py", line 6995, in _read
self._Show()
File "/usr/local/lib/python3.7/site-packages/PySimpleGUI/PySimpleGUI.py", line 6831, in _Show
StartupTK(self)
File "/usr/local/lib/python3.7/site-packages/PySimpleGUI/PySimpleGUI.py", line 11301, in StartupTK
ConvertFlexToTK(my_flex_form)
File "/usr/local/lib/python3.7/site-packages/PySimpleGUI/PySimpleGUI.py", line 11203, in ConvertFlexToTK
PackFormIntoFrame(MyFlexForm, master, MyFlexForm)
File "/usr/local/lib/python3.7/site-packages/PySimpleGUI/PySimpleGUI.py", line 10577, in PackFormIntoFrame
photo = tk.PhotoImage(data=element.Data)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/tkinter/init.py", line 3545, in init
Image.init(self, 'photo', name, cnf, master, **kw)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/tkinter/init.py", line 3501, in init
self.tk.call(('image', 'create', imgtype, name,) + options)
_tkinter.TclError: couldn't recognize image data

This seems to be a problem with the Tcl/Tk version 8.5 that comes with the python3 version installed with 'brew'. It can be fixed by uninstalling python3 and manually reinstalling it with the latest version from https://www.python.org/downloads/.

need to check end of fasta file

Hello, first of all, big thanks and thumbs up for this wonderful tool!

If my fasta file comes directly from JAMP's <0.01% read abundance filtering (as attached error_1.fasta), I need to remove the last 2 lines:

>below_0.01
 NA

otherwise boldigger crashes and terminal shows an error UnboundLocalError: local variable 'index' referenced before assignment
I also had to make sure an empty line is after the last sequence. Because if number of OTUs (in the last batch) is equal to batch size and there's no empty line after the last OTU sequence, I get the error as well.
While playing around with the issue, it also sometimes crashed with IndexError: list index out of range and the same traceback (see below). Seems like this happens when the empty line is the only thing in the last batch, which is no problem, because all OTUs are already processed at that point.

I am new to python and now I have no time to suggest a fix. So in case you decide that it's not worth fixing in the code, it would be useful at least to tell users through README that they might need to edit their fastas as I did. Good luck!

Traceback (most recent call last): File "/home/ono/.local/bin/boldigger", line 11, in <module> sys.exit(main()) File "/home/ono/.local/lib/python3.6/site-packages/boldigger/__main__.py", line 60, in main boldblast_coi.main(session, values['fasta_path'], values['output_folder'], values['batch_size']) File "/home/ono/.local/lib/python3.6/site-packages/boldigger/boldblast_coi.py", line 246, in main dataframes = save_as_df(html_list, sequences_names[querys.index(query)]) File "/home/ono/.local/lib/python3.6/site-packages/boldigger/boldblast_coi.py", line 104, in save_as_df cols = dataframes[index].columns.tolist() UnboundLocalError: local variable 'index' referenced before assignment

Wrong file format while using "JAMP pipeline"

When adding top hits with the JAMP method, it said the wrong file format. However, using the first hit method was fine.

Hymenoptera issue

Hey Dominic,

First: BOLDIGGER is truly a great tool. We do a lot of insect metabarcoding at the moment and wouldn't get such high quality results without your it, so thank so much for constructing it!

One small issue keeps coming up that I think you should know: Hymenoptera hits don't get transported correctly to the results. Here's an example in the BOLDResults file:

And here's the hits on BOLD website:

I also notice that the order in which the ASVs are in the 'BOLDigger hit' tab of the BOLDResults excel file are skewed whenever a Hymenopteran comes up:

versus the order in the first tab (starting with 'Run dd/mm/yyyy tt.tt'):

0cc778207a38b7a6e881d96d2ae63bbf
1a994f0b7d83ca1fd64f85e5463ab493
20f61de5ad7c87a8320d5a7c6c403aae
28052a092ec236890bf2034336b82950 (this one is different, but also a Hymenopteran)
4a8e00925b4fcd21335a0ee102ffcba2

I'm not sure what could be causing it, but just wanted to flag it. The primer pair I use (mlCOIintF, HCO2198) is not great at Hymenopterans (or so I understand from literature) but I'm wondering if it could have to do with the bioinformatics..

Thanks for your thoughts in advance, hope I explained it well to you,

Best regards,

Marcel Polling

Bold not responding

Hi and thank you for developing this tool
We have tried to our fasta file with BolDigger but unfortunately we get an error message saying that bold is not responding when it tries to download the results

We have run the cline pipeline with your test database and get the same results. Any suggestion s on how to solve this?
Thi is the code we have tried
boldigger_cline ie_coi username password C:\Users\elbr2874\Downloads\COI.fasta C:\Users\elbr2874\Desktop\boldresults
Best
Francisco

Assigning BIN to top hits

Getting Barcode Index Numbers (BINs) could be useful for certain analyses as a extra source of information, especially in datasets with many undescribed species.

I imagine the procedure for assigning BIN as something along the lines of:

Initially use the same algorithm as for assigning JAMP hits.
If the threshold is set at 98% at step 2, then attempt to assign a BIN.
For this, use the same subset of hits as used in step 5 of the JAMP algorithm (“Look for the most common hit that has no missing values”) and check for the number of unique BINs within this subset.
If this number is 1, assign that unique BIN.
If more than one BIN is found in the subset, do not assign a BIN (i.e., leave the BIN field empty). Perhaps these cases could be flagged (with a number and a list of the unique BINs found).

xlsx output file error?

Hi @DominikBuchner

I am having issues making sense of the xlsx output file. I have a fasta with 99 seqs, ASV_1 to ASV_99.

However, the first column of the xlsx output file only contains odd numbered ASV IDs and some raw sequences in between. Where are my even numbered ASVs and what do the raw sequences mean? Further, the file only contains ASVs down to ASV_87, don't know what happened to the rest. I checked my input fasta, all 99 ASVs are definitely there and have the regular format >ASV_x followed by the raw sequence on the next line.

My output file is attached.

BOLDResults_queryFile.fa.1.0.560931114014238.xlsx

Thanks for having a look at it. I really would like to use your tool, but given the current output this seems a bit difficult ;-)

cheers
Nauras

Discrepancy between BOLDigger output and BOLD's identification engine

Hi,

I came a cross a weird by chance while going through my BOLDigger output file.

I have ~8,000 COI metabarcoding sequences which I classified with BOLDigger. I was using boldigger-cline v2.1.2 at that time, which was in July 2023. I opened this issue here because I dont think this is an issue specific to the commandline tool.

Below is are the top 20 hits for ASV17731:

When I manually check this sequence on BOLD's website against the All Barcode Records on BOLD database, I get the following nearest matches:

What btohers me is that these 20 matches in BOLD and BOLDigger are almost identical. But BOLDigger says this ASV has 87.38% similarity with the insect family Chironomidae, while on BOLD, this similarity value is attributed to a taxon of Ochrophyta.

I checked this now, in August 2023, so a month from the initial classification. But I honestly dont think that this has anything to do with this. Or am I wrong?

How can this sequence - according to BOLdigger - have the exact same similarity value for an insect family as well as an algae, while the former is not even listed in the output when I manually consult the BOLD identification engine?

This is the sequence in case you would like to reproduce the problem:

ASV17731
ATTATCATCTATTCAAGCGCATTCAGGGCCTTCAGTAGATATGGCGATTTTTAGTTTACATTTATCAGGTGCAGGTTCTATTTTAGGAGCAATTAATTTTATTGTAACTATCTTTAACATGCGTGCCCCAGGACTTTTCTTACATAAAATGCCTCTTTTTGTATGATCTGTATTAGTAACTGCATTTTTACTTTTATTATCTTTACCAGTTTTCGCTGGAGCAATTACTATGCTTTTAACAGATCGTAACTTTAATACAAGCTTTTATGATCCTGCCGGAGGAGGAGATCCAGTATTATACCAACATCTTTTC

Cheers

nauras

Dataframe index issues in boldblast_coi.py, with workaround

Kudos for creating BOLDigger! This is the smoothest way to use BOLD I have encountered so far.
I encountered this error while trying to launch the Boldigger GUI from PowerShell:

PS C:\Users\username> boldigger
Traceback (most recent call last):
File "c:\users\username\appdata\local\programs\python\python38-32\lib\runpy.py", line 194, in _run_module_as_main
return run_code(code, main_globals, None,
File "c:\users\username\appdata\local\programs\python\python38-32\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\username\AppData\Local\Programs\Python\Python38-32\Scripts\boldigger.exe_main.py", line 7, in
File "c:\users\username\appdata\local\programs\python\python38-32\lib\site-packages\boldigger_main.py", line 70, in main
boldblast_coi.main(session, values['fasta_path'], values['output_folder'], values['batch_size'])
File "c:\users\username\appdata\local\programs\python\python38-32\lib\site-packages\boldigger\boldblast_coi.py", line 247, in main
dataframes = save_as_df(html_list, sequences_names[querys.index(query)])
File "c:\users\username\appdata\local\programs\python\python38-32\lib\site-packages\boldigger\boldblast_coi.py", line 105, in save_as_df
cols = dataframes[index].columns.tolist()
UnboundLocalError: local variable 'index' referenced before assignment

The same issue occurred under Windows 10 and Ubuntu 18.04 with Python 3.8.5 and a fresh install of Boldigger 1.2.1. Looking at line 105 in boldblast_coi.py, there is potential for a variable scope error:

    ## save columns before to sort them after
    cols = dataframes[index].columns.tolist()

This 'index' is the iterator variable in the preceding for...range loop. I inserted an explicit 'index = 0' before the loop, which lets the GUI launch. However, when I try to run the IDS query, I get a new error in the same statement:

PS C:\Users\username> boldigger
Traceback (most recent call last):
File "c:\users\username\appdata\local\programs\python\python38-32\lib\runpy.py", line 194, in _run_module_as_main
return run_code(code, main_globals, None,
File "c:\users\username\appdata\local\programs\python\python38-32\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\username\AppData\Local\Programs\Python\Python38-32\Scripts\boldigger.exe_main.py", line 7, in
File "c:\users\username\appdata\local\programs\python\python38-32\lib\site-packages\boldigger_main.py", line 70, in main
boldblast_coi.main(session, values['fasta_path'], values['output_folder'], values['batch_size'])
File "c:\users\username\appdata\local\programs\python\python38-32\lib\site-packages\boldigger\boldblast_coi.py", line 248, in main
dataframes = save_as_df(html_list, sequences_names[querys.index(query)])
File "c:\users\username\appdata\local\programs\python\python38-32\lib\site-packages\boldigger\boldblast_coi.py", line 106, in save_as_df
cols = dataframes[index].columns.tolist()
IndexError: list index out of range

I think the root of the problem this time was that my COI FASTA file included some sequences that are too short for IDS; six sequences out of 3747 were shorter than 30 nt, and any batch of 100 sequences containing these failed to process. Somehow that led to 'dataframes' being assigned an empty list (verified by 'print(len(dataframes))' ) instead of 'nomatch_df'. This issue was partially fixable by adding a check for 'dataframes' being empty:

    ## add process IDs to published sequence
    **index = 0**
    for index in range(len(dataframes)):
        {...}

    ## save columns before to sort them after
    cols = dataframes[index].columns.tolist() **if len(dataframes) > 0 else []**
    ## add sequence names to dataframes
    for index in range(len(dataframes)):
        {...}

With the additions in boldface, BOLDigger completed successfully, although batches including reads too short for IDS led to empty dataframes, effectively skipping that batch. Adding additional data and selecting best hits for the downloaded hits worked without issues.

I have not investigated the scripts for the other databases in detail, but if the same code was used for 'save_as_df', the same problem could occur for ITS and rbcL/Matk.

Damaged BOLD records stop the identification engine

For unknown reasons some of BOLDs records are broken, leading to a never loading Top 20 hit table. BOLDigger goes through the maximum number of retrys for those specific entrys until is stops working and crashes.

Fix needed:

maybe assign "broken record" to those hits and retrieve the top hit later on via API?

boldsystems switched to https protocol, this broke the code

boldsystems.org changed their protocol to https which broke the code. Fix is coming.

BOLDigger hit type 2 overflagging

Many records in the BOLD System have in their specific epithet information that does not correspond exactly to the species name.
e.g.: sp. a AK-2021, sp. (Johor), communis A1A2, cf. alpium.

Because of this many hits are labelled in the Boldigger hit pipeline as type 2.

Therefore I think it could be interesting to add a species name cleaning step at the beginning of the Boldigger hit process as follows:

Delete the species name completely if it contains:
"sp." lack species name (e.g. sp. CFJS-2021b)
"cf." doubtful species name (e.g. cf. micrura)
"aff." doubtful species name (e.g. aff. hornsundi)
"grp." group, doubtful species name (e.g. pedellus grp.)
" / " doubtful species name

Erase after: (To leave only the species name)
" ssp." subespecies name
" var." variant name (e.g. australogibba var. subcapitata)
" " addition information for a species added after the species name (e.g. bilobata CEA) After this I would delete the boxes containing numbers. (e.g. sp0949C, Malaise3164)

One element on which I doubt whether or not it should be deleted is hybrids. It might be interesting to remove them by default but leave a command as an option not to do so.
" x " hybrids (e.g. pennsylvanicus x firmus)

Finding best fitting hit fails for JAMP and BOLDigger method

Hi,

I having issues getting the top hit for the JAMP and BOLDigger methods. I am using the most recent version.

Error message (BOLDigger method, JAMP method fails with the same error):

$ boldigger-cline digger_hit BOLDResults_COI_cluster_reps_curated_no_contam.xlsx
12:08:01: Opening resultfile.
12:08:23: Filtering data for JAMP hits.
Traceback (most recent call last):
File "c:\users\nauras\programs\python\python39\lib\runpy.py", line 197, in _run_module_as_main
return run_code(code, main_globals, None,
File "c:\users\nauras\programs\python\python39\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\Nauras\Programs\Python\Python39\Scripts\boldigger-cline.exe_main.py", line 7, in
File "c:\users\nauras\programs\python\python39\lib\site-packages\boldigger_cline_main.py", line 70, in main
digger_sort.main(args.xlsx_path)
File "c:\users\nauras\programs\python\python39\lib\site-packages\boldigger_cline\digger_sort.py", line 15, in main
jamp_hits = [jamp_hit(otu) for otu in otu_dfs]
File "c:\users\nauras\programs\python\python39\lib\site-packages\boldigger_cline\digger_sort.py", line 15, in
jamp_hits = [jamp_hit(otu) for otu in otu_dfs]
File "c:\users\nauras\programs\python\python39\lib\site-packages\boldigger\jamp_hit.py", line 55, in jamp_hit
threshold, level = get_threshold(df)
File "c:\users\nauras\programs\python\python39\lib\site-packages\boldigger\jamp_hit.py", line 13, in get_threshold
elif threshold >= 98:
TypeError: '>=' not supported between instances of 'str' and 'int'

There seems to be an error with how similarity values are being assessed? The first hit method works fine. My results file is attached.
BOLDResults_COI_cluster_reps_curated_no_contam.xlsx. The same issue appears with the commandline version.

Cheers

Nauras

Supspecies information dropped from best fitting hit

Hi,

any reason why subspecies assignments are dropped when BOLDigger v2.2.0 chooses the best hit?

Cheers

Nauras

trouble running jamp_hit and digger_hit

Hello @DominikBuchner ,

I am struggling with getting jamp and boldigger methods to run, both in command line and interface. This is the first dataset I am running since I have updated to the new version, and it keeps failing when creating a new sheet:

13:27:20: Opening resultfile.
13:27:47: Filtering data for JAMP hits.
13:30:35: Extracting additional data.
13:31:41: Flagging the hits.
13:34:59: Saving result to new tab.
Traceback (most recent call last):
File "/home/filipamsmartins/.local/bin/boldigger-cline", line 8, in
sys.exit(main())
File "/home/filipamsmartins/.local/lib/python3.8/site-packages/boldigger_cline/main.py", line 70, in main
digger_sort.main(args.xlsx_path)
File "/home/filipamsmartins/.local/lib/python3.8/site-packages/boldigger_cline/digger_sort.py", line 30, in main
save_results(xlsx_path, output)
File "/home/filipamsmartins/.local/lib/python3.8/site-packages/boldigger/digger_sort.py", line 128, in save_results
writer.book = wb
AttributeError: can't set attribute

Thank you in advance for your help. Best,
Filipa

First hit on corrupt excel files

The first hit sorting method fails on some excel files.
No yet sure what causes the error since the other options work fine.
Fix needed.

Long fasta files (>50 seqs) stop the identification engine

For fasta files with many sequences, the identification engine stops after 50 or 100 records, and enters a loop in which no data is retrieved. If the name of the fasta file is changed, then it works again for another 50-100 records. We solved the problem by submitting a different fasta file per sequence (or small batch of sequences) and then merging the information at the end, but it would be good that boldigger did that automatically. I believe BOLD is blocking the retrieval of too many records from the same request file?

API verificiation fails

API verification fails in version 2.0.5, just like in the command line version. Very briefly (for a split second), something like the following pops up:

14:55:20: Starting API verification.
14:55:20: Collection OTUs without species level identification and high similarity.

Then Boldigger just terminates. Below the error message I got when running the command line version. Guess it is applicable for the GUI version as well:

Traceback (most recent call last):
File "c:\users\nauras\programs\python\python39\lib\runpy.py", line 197, in _run_module_as_main
return run_code(code, main_globals, None,
File "c:\users\nauras\programs\python\python39\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\Nauras\Programs\Python\Python39\Scripts\boldigger-cline.exe_main.py", line 7, in
File "c:\users\nauras\programs\python\python39\lib\site-packages\boldigger_cline_main.py", line 73, in main
api_verification.main(args.xlsx_path, args.fasta_path)
File "c:\users\nauras\programs\python\python39\lib\site-packages\boldigger_cline\api_verification.py", line 17, in main
raw_data, data_to_check, seq_dict = extract_data(xlsx_path, fasta_path)
File "c:\users\nauras\programs\python\python39\lib\site-packages\boldigger\api_verification.py", line 31, in extract_data
raw_data = pd.read_excel(xlsx_path, sheet_name = 'BOLDigger hit')
File "c:\users\nauras\programs\python\python39\lib\site-packages\pandas\util_decorators.py", line 296, in wrapper
return func(*args, **kwargs)
File "c:\users\nauras\programs\python\python39\lib\site-packages\pandas\io\excel_base.py", line 304, in read_excel
io = ExcelFile(io, engine=engine)
File "c:\users\nauras\programs\python\python39\lib\site-packages\pandas\io\excel_base.py", line 867, in init
self._reader = self.enginesengine
File "c:\users\nauras\programs\python\python39\lib\site-packages\pandas\io\excel_xlrd.py", line 22, in init
super().init(filepath_or_buffer)
File "c:\users\nauras\programs\python\python39\lib\site-packages\pandas\io\excel_base.py", line 353, in init
self.book = self.load_workbook(filepath_or_buffer)
File "c:\users\nauras\programs\python\python39\lib\site-packages\pandas\io\excel_xlrd.py", line 37, in load_workbook
return open_workbook(filepath_or_buffer)
File "c:\users\nauras\programs\python\python39\lib\site-packages\xlrd_init.py", line 170, in open_workbook
raise XLRDError(FILE_FORMAT_DESCRIPTIONS[file_format]+'; not supported')
xlrd.biffh.XLRDError: Excel xlsx file; not supported

Seems like a problem with reading Excel files?? BOLDiger results file with BOLDigger top hits attached.
trial.xlsx

Cheers

Nauras

Ubuntu: rbcl fasta file run error

Hi, I'm using the BOLDigger v1.5.6 with Python 3.8 and there was an error:

Traceback (most recent call last):
File "/home/yao/.local/bin/boldigger", line 8, in
sys.exit(main())
File "/home/yao/.local/lib/python3.8/site-packages/boldigger/main.py", line 108, in main
boldblast_rbcl.main(session, values['fasta_path'], values['output_folder'], values['batch_size'])
File "/home/yao/.local/lib/python3.8/site-packages/boldigger/boldblast_rbcl.py", line 70, in main
tables = asyncio.run(as_session(links))
File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/home/yao/.local/lib/python3.8/site-packages/boldigger/boldblast_its.py", line 70, in as_session
return await asyncio.gather(*tasks)
File "/home/yao/.local/lib/python3.8/site-packages/boldigger/boldblast_its.py", line 46, in as_request
table = pd.DataFrame([[0], ['No Match'] * 7 + [np.nan] * 3 + [''] * 2] * 20,
File "/home/yao/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 721, in init
arrays, columns, index = nested_data_to_arrays(
File "/home/yao/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 519, in nested_data_to_arrays
arrays, columns = to_arrays(data, columns, dtype=dtype)
File "/home/yao/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 883, in to_arrays
content, columns = _finalize_columns_and_data(arr, columns, dtype)
File "/home/yao/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 985, in _finalize_columns_and_data
raise ValueError(err) from err
ValueError: 13 columns passed, passed data had 12 columns

My input file is .fas file and contains 100 sequences, the batch size was set as 5.
DO anyone know how to fix it ? Thanks a lot !!

No Matches break the ITS and rbcl ID Engine atm

Wrong number of columns passed to the id engine code for ITS and rbcl, will fix soon

Is BOLD blocking BOLDigger

Hi Dom!

Me and some lab mates have been trying to use BOLDigger this week without much luck, unfortunately. I haven't had any trouble in the past, but now I can't get the program to run without crashing. It always produces the same error: "too many 500 error responses". I assume that this means the connection is timing out - or that BOLD itself is blocking my IP from accessing the ID Engine. I've tried multiple different batch sizes in case the issue is a time-out error related to a too-large batch size. But alas, I get the same error even with a batch size of 1. I have also logged into BOLD and manually submitted ID Engine requests ranging from 1-20 sequences at a time with no issue, so I don't think it's a problem with my firewall or internet connection. I've even tried it from various locations (different networks), different computers, on VPN vs off. None of it works!

All of this to say - are you experiencing the same problems? Do you have any idea what's going on? Perhaps I'm missing something simple!

dominikbuchner / boldigger Goto Github PK

boldigger's People

Contributors

Stargazers

Watchers

Forkers

boldigger's Issues

Description

Screenshots

Environment

Recommend Projects

Recommend Topics

Recommend Org

Jobs