shawhahnlab / igseq Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 1011 KB

IgSeq utilities

Python 95.88% Shell 4.12%

bioinformatics immunology

igseq's People

Contributors

Stargazers

Watchers

igseq's Issues

Exceptions in the igblast stdin thread need to be handled

If an error occurs during igblast inside the thread handling the input (e.g. "igseq.util.IgSeqError: format not detected from filename (foo.bar). specify a format manually.") it currently goes nowhere. These should pass along to the main thread.

igseq msa crashes with only one input sequence

igseq msa crashes when asked to make an MSA of just a single sequence, because the underlying call to MUSCLE segfaults. It should just bypass the call to MUSCLE for a single input sequence as it already does for no input sequences.

Identity calculation should disregard gaps and uppercase/lowercase

In using the new identity command I realized it should ignore gaps and uppercase/lowercase differences when comparing sequences.

Tree color-coding should support defining sequence sets by sequence content

As a special case of the tip label color-coding for .nex tree output, igseq tree should support defining color-coded sequence sets by sequence content provided in an alignment MSA FASTA.

Broken pipes should be handled quietly in command-line interface

Usage like igseq subcommand | somethingelse tends to crash with BrokenPipeError if the somethingelse stops reading early. Instead it should just quietly catch the exception and quit.

Handling this correctly might be a little trickier than I would have thought:

https://stackoverflow.com/a/69833114

NEXUS output files could include taxon set definitions

The NEXUS format supports a sets block that could store the sets defined in a call to igseq tree. Could be handy to explicitly record the sets (instead of just implicitly via node colors) though I'm not sure what existing tools make use of that block.

Remove ORF sequences from SONARRamesh rhesus germline set

I'd originally left rhesus germline V genes with the "ORF" prefix in place as SONAR has them, but I should probably remove them. IgDiscover will very occasionally detect (technically correctly, as far as I can tell, based on the removed introns) transcribed antibody recombinations for V genes labeled non-functional ORFs in Ramesh et al. 2017's rhesus reference, but, these are still presumably non-functional from a protein standpoint, so we'd rather not have them reported in the output. The names also cause trouble for some of SONAR's parsing code. Removing them would make the germline sequences here consistently limited to functional cases.

ORF_IGHV3-AHA-X*01
ORF_IGHV3-AHG-X*01
ORF_IGHV3-AHH-X*01
ORF_IGKV1-AES-X*01
ORF_IGKV2-AEQ-X*01
ORF_IGKV2-AET-X*01
ORF_IGKV2-AET-X*03
ORF_IGKV2-AET-X*04
ORF_IGLV7-ADJ*01
ORF_IGLV7-ADJ*02
ORF_IGLV8-AEE-X*01
ORF_IGLV8-AEE-X*02
ORF_IGLV8-AEE-X*03

show command should support stdin and explicit input formats

The show command should support reading from stdin and accepting explicit input formats like the convert command does.

getreads fails with bcl2fastq > 2.17

Versions of bcl2fastq > 2.17 don't support --demultiplexing-threads so getreads crashes. There's nothing I can see in the release notes about that but it's mentioned by NIH.

Probably should just remove that argument entirely here. Even the latest bcl2fastq is 5+ years old at this point.

summarize command should report detected regions

A column giving shorthand for detected V/D/J/C regions present in each sequence would be helpful.

igblast command should stream input/output

The igblast command gathers up output in one big string instead of processing it as it comes, potentially using up way too much RAM. This should instead use subprocess' Popen class (like the trimming code already does).

Color-coding in trees should default to simple set handling

I made the tree color-coding logic to help summarize what sequences are present in different overlapping sets (e.g. before and after IgDiscover), so by default it merges colors across sets for sequences in multiple sets. But most of the time it's much less confusing to just have colors map directly to sequences with some default for the multi-set case. That should be the default, with the color-merging a separate option.

Stderr/stdout should be shown if igblast crashes

Right now if a subprocess.CalledProcessError is raised because of a nonzero exit code from igblastn, the error message is inside the stderr attribute of the exception, but it's never actually shown so the cause of the problem isn't clear. The stdout/stderr attributes of the raised exception should be written to the respective streams before the exception is re-raised.

Expected igblastn output should be synced with pinned installed version

I made test output files with IgBLAST 1.19.0, but IgBLAST 1.21.0 added new columns to its AIRR output (sequence_aa, d_frame) and in general its output does change between minor releases. This is making the igblast.sh example script fail when it compares expected and actual TSV outputs. I should pin igblast at a specific version here and make the saved outputs here match that version.

summarize command should follow same species-based default as igblast command

Same as #38, but for summarize as well as igblast.

Uppercase file extensions should be supported in show command

Currently a filename like file.txt is recognized by the show command but file.TXT is not. It should be case-insensitive.

.afa should be recognized as a FASTA format

.afa (used in some SONAR filenames for example) should be recognized by default as a filename extension for FASTA.

Add alleles from 10.4049/jimmunol.1800342

Another rhesus antibody V+J gene dataset was published in 2019 that we should add here. I have the sequences ready at
https://github.com/ressy/paper-helper-zhang-2019

Support extra command-line arguments for bcl2fastq in getreads command

It should be possible to specify custom bcl2fastq arguments (.e.g. --with-failed-reads) to be added to the command. Should be easy enough by following the pattern already used for the igblast command.

record writer crashes on unexpected sequence_description key

When RecordWriter tries to write a record containing sequence_description, and the first record didn't have a description, it crashes because there was no sequence_description in the initial field names. This bug showed up in #51.

For example:

$ echo -e '>seq1 desc\nACTG\n>seq2\nACTG' | igseq convert - - --input-format fa --output-format csv
sequence_id,sequence,sequence_description
seq1,ACTG,desc
seq2,ACTG,
$ echo -e '>seq1\nACTG\n>seq2 desc\nACTG' | igseq convert - - --input-format fa --output-format csv
sequence_id,sequence
seq1,ACTG
Traceback (most recent call last):
  File "/home/jesse/miniconda3/envs/igseqhelper/bin/igseq", line 10, in <module>
    sys.exit(main())
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/__main__.py", line 89, in main
    args.func(args)
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/__main__.py", line 225, in _main_convert
    convert.convert(
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/convert.py", line 23, in convert
    writer.write(record)
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/record.py", line 258, in write
    self.writer.writerow(record)
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/csv.py", line 154, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/csv.py", line 149, in _dict_to_list
    raise ValueError("dict contains fields not in fieldnames: "
ValueError: dict contains fields not in fieldnames: 'sequence_description'

A cut feature would be handy

An igseq cut subcommand could allow a simple interface to extract different antibody regions (e.g. FWR1-FWR3 for to get V genes truncated before CDR3). This should be fairly straightforward if based on the IgBLAST AIRR TSV columns that give start and end positions relative to the query sequence.

Built-in references with missing segments should be skipped with warning in vdj-match

If one of the built-in VDJ references is picked up as a reference for vdj-match, but is missing a segment (specifically zhang2019 with just V and J), the vdj-match call fails, which doesn't make much sense. Instead it should be skipped with a warning.

Germline reference "bernat2021" should be replaced with latest KIMDB sequences

The rhesus gene references from the original publication have now been expanded on and corrected in KIMDB v1.1. This makes more sense for a built-in reference than the original set.

igblast with species given should default to all built-in references for the species

igseq igblast -S rhesus -Q ... fails, but logically it should be equivalent to igseq igblast -S rhesus -r germ/rhesus -Q ....

Tree output should support including FigTree options

FigTree saves a custom block at the end of its NEXUS output with its display settings. igseq tree should support writing those options into its output too, for .nex files.

summarize crashes with multiple references

It doesn't take into account the modified sequence IDs from the IgBLAST database.

tree color-coding crashes when exactly one set is defined

When there's exactly one set of sequences defined, make_seq_set_colors crashes:

$ igseq tree -P '.*' --input-format newick <(echo '((A:1,B:1):1)') out.nex
$ igseq tree -P A --input-format newick <(echo '((A:1,B:1):1)') out.nex
Traceback (most recent call last):
  File "/home/jesse/miniconda3/envs/igseqhelper/bin/igseq", line 10, in <module>
    sys.exit(main())
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/__main__.py", line 89, in main
    args.func(args)
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/__main__.py", line 257, in _main_tree
    tree.tree(
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/tree.py", line 137, in tree
    seq_colors = color_seqs(seq_ids, seq_sets_combo, colors)
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/tree.py", line 296, in color_seqs
    seq_set_colors_combo = make_seq_set_colors(seq_sets)
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/tree.py", line 317, in make_seq_set_colors
    subset = [int( a * (num-1) / (len(seq_sets)-1) ) for a in range(num)]
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/tree.py", line 317, in <listcomp>
    subset = [int( a * (num-1) / (len(seq_sets)-1) ) for a in range(num)]
ZeroDivisionError: division by zero

Sequence descriptions aren't handled

The logic for handling FASTA definition lines beyond the ID (after the first space) was never fully implemented, so things like igseq convert --col-seq-desc ... do nothing. (I never tested for this?) The exact definitions of "sequence ID" versus "sequence description" versus "sequence definition" are also not clear at all.

PhiX mapping counts path should be cast to Path object

The command igseq phix ... -c something.counts.csv doesn't actually work because counts_out is treated as a Path object in the code but left as a string as passed from the command line. It should probably be cast as a Path immediately in the phix function.

trim should allow custom adapters

trim should allow for custom constant (rather than varying per-sample) adapter sequences, instead of only the ones inferred from known barcode and adapter+primer sequences. This would make it usable on datasets prepped with other protocols as a more generic cutadapt wrapper.

Custom sequence description columns are ignored

When converting to tabular output, custom sequence description column names are ignored.

In record.py, this:

record["sequence_description"] = seq_desc

should be:

record[self.colmap["sequence_description"]] = seq_desc

Custom columns should be supported for convert command

Currently the column names for tabular input and/or output are stored in a dictionary for flexibility but that isn't actually linked up to the convert command's arguments. These should be command-line arguments. (Easy use case: pull out junction sequences from AIRR TSV)

Extra igblast arguments can clash with argparse

igblastn uses single dashes in front of arguments, but those can be misinterpreted by argparse as a single-letter argument followed by an option with no space. For example -num_alignments_V 5 gives argument -n/--dry-run: ignored explicit argument 'um_alignments_V'. I can't see how to handle that from the argparse side but a simple workaround would be to accept two dashes for igblastn arguments and then remove the extra dash internally.

identity command ignores sequence ID column name for tabular input

The identity command uses a hardcoded reference to the sequence_id column and ignores any custom column like from --col-seq-id. (The existing tests check a custom sequence column, which works, but not a custom sequence ID column, which evidently doesn't.)

Duplicate inferred FASTA paths should only be handled once

vdj-gather can take fragments of builtin file paths as input, so you could do, for example:

igseq vdj-gather sonarramesh/IGK sonarramesh/IGH/IGHD -o igdiscover-db-start

...and get IGK genes plus a placeholder D file. But something like this results in D sequences being written twice:

igseq vdj-gather sonarramesh/IGH sonarramesh/IGH/IGHD -o igdiscover-db-start

Instead vdj.parse_vdj_paths should condense duplicate FASTA file paths internally.

show command should support newick tree files

There are packages available (like python-newick) that support easily pretty-printing tree topology from newick-format trees. This could be a handy feature for the show command.

shawhahnlab / igseq Goto Github PK

igseq's People

Contributors

Stargazers

Watchers

igseq's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs