shawhahnlab / igseq Goto Github PK
View Code? Open in Web Editor NEWIgSeq utilities
IgSeq utilities
If an error occurs during igblast inside the thread handling the input (e.g. "igseq.util.IgSeqError: format not detected from filename (foo.bar). specify a format manually.") it currently goes nowhere. These should pass along to the main thread.
igseq msa
crashes when asked to make an MSA of just a single sequence, because the underlying call to MUSCLE segfaults. It should just bypass the call to MUSCLE for a single input sequence as it already does for no input sequences.
In using the new identity
command I realized it should ignore gaps and uppercase/lowercase differences when comparing sequences.
As a special case of the tip label color-coding for .nex tree output, igseq tree
should support defining color-coded sequence sets by sequence content provided in an alignment MSA FASTA.
Usage like igseq subcommand | somethingelse
tends to crash with BrokenPipeError if the somethingelse
stops reading early. Instead it should just quietly catch the exception and quit.
Handling this correctly might be a little trickier than I would have thought:
The NEXUS format supports a sets block that could store the sets defined in a call to igseq tree
. Could be handy to explicitly record the sets (instead of just implicitly via node colors) though I'm not sure what existing tools make use of that block.
I'd originally left rhesus germline V genes with the "ORF" prefix in place as SONAR has them, but I should probably remove them. IgDiscover will very occasionally detect (technically correctly, as far as I can tell, based on the removed introns) transcribed antibody recombinations for V genes labeled non-functional ORFs in Ramesh et al. 2017's rhesus reference, but, these are still presumably non-functional from a protein standpoint, so we'd rather not have them reported in the output. The names also cause trouble for some of SONAR's parsing code. Removing them would make the germline sequences here consistently limited to functional cases.
ORF_IGHV3-AHA-X*01
ORF_IGHV3-AHG-X*01
ORF_IGHV3-AHH-X*01
ORF_IGKV1-AES-X*01
ORF_IGKV2-AEQ-X*01
ORF_IGKV2-AET-X*01
ORF_IGKV2-AET-X*03
ORF_IGKV2-AET-X*04
ORF_IGLV7-ADJ*01
ORF_IGLV7-ADJ*02
ORF_IGLV8-AEE-X*01
ORF_IGLV8-AEE-X*02
ORF_IGLV8-AEE-X*03
The show
command should support reading from stdin and accepting explicit input formats like the convert
command does.
Versions of bcl2fastq > 2.17 don't support --demultiplexing-threads
so getreads
crashes. There's nothing I can see in the release notes about that but it's mentioned by NIH.
Probably should just remove that argument entirely here. Even the latest bcl2fastq is 5+ years old at this point.
A column giving shorthand for detected V/D/J/C regions present in each sequence would be helpful.
The igblast
command gathers up output in one big string instead of processing it as it comes, potentially using up way too much RAM. This should instead use subprocess' Popen class (like the trimming code already does).
I made the tree color-coding logic to help summarize what sequences are present in different overlapping sets (e.g. before and after IgDiscover), so by default it merges colors across sets for sequences in multiple sets. But most of the time it's much less confusing to just have colors map directly to sequences with some default for the multi-set case. That should be the default, with the color-merging a separate option.
Right now if a subprocess.CalledProcessError
is raised because of a nonzero exit code from igblastn, the error message is inside the stderr attribute of the exception, but it's never actually shown so the cause of the problem isn't clear. The stdout/stderr attributes of the raised exception should be written to the respective streams before the exception is re-raised.
I made test output files with IgBLAST 1.19.0, but IgBLAST 1.21.0 added new columns to its AIRR output (sequence_aa
, d_frame
) and in general its output does change between minor releases. This is making the igblast.sh example script fail when it compares expected and actual TSV outputs. I should pin igblast at a specific version here and make the saved outputs here match that version.
Same as #38, but for summarize
as well as igblast
.
Currently a filename like file.txt is recognized by the show
command but file.TXT is not. It should be case-insensitive.
.afa (used in some SONAR filenames for example) should be recognized by default as a filename extension for FASTA.
Another rhesus antibody V+J gene dataset was published in 2019 that we should add here. I have the sequences ready at
https://github.com/ressy/paper-helper-zhang-2019
It should be possible to specify custom bcl2fastq arguments (.e.g. --with-failed-reads
) to be added to the command. Should be easy enough by following the pattern already used for the igblast command.
When RecordWriter
tries to write a record containing sequence_description
, and the first record didn't have a description, it crashes because there was no sequence_description
in the initial field names. This bug showed up in #51.
For example:
$ echo -e '>seq1 desc\nACTG\n>seq2\nACTG' | igseq convert - - --input-format fa --output-format csv
sequence_id,sequence,sequence_description
seq1,ACTG,desc
seq2,ACTG,
$ echo -e '>seq1\nACTG\n>seq2 desc\nACTG' | igseq convert - - --input-format fa --output-format csv
sequence_id,sequence
seq1,ACTG
Traceback (most recent call last):
File "/home/jesse/miniconda3/envs/igseqhelper/bin/igseq", line 10, in <module>
sys.exit(main())
File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/__main__.py", line 89, in main
args.func(args)
File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/__main__.py", line 225, in _main_convert
convert.convert(
File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/convert.py", line 23, in convert
writer.write(record)
File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/record.py", line 258, in write
self.writer.writerow(record)
File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/csv.py", line 154, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/csv.py", line 149, in _dict_to_list
raise ValueError("dict contains fields not in fieldnames: "
ValueError: dict contains fields not in fieldnames: 'sequence_description'
An igseq cut
subcommand could allow a simple interface to extract different antibody regions (e.g. FWR1-FWR3 for to get V genes truncated before CDR3). This should be fairly straightforward if based on the IgBLAST AIRR TSV columns that give start and end positions relative to the query sequence.
If one of the built-in VDJ references is picked up as a reference for vdj-match, but is missing a segment (specifically zhang2019 with just V and J), the vdj-match call fails, which doesn't make much sense. Instead it should be skipped with a warning.
The rhesus gene references from the original publication have now been expanded on and corrected in KIMDB v1.1. This makes more sense for a built-in reference than the original set.
igseq igblast -S rhesus -Q ...
fails, but logically it should be equivalent to igseq igblast -S rhesus -r germ/rhesus -Q ...
.
FigTree saves a custom block at the end of its NEXUS output with its display settings. igseq tree
should support writing those options into its output too, for .nex files.
It doesn't take into account the modified sequence IDs from the IgBLAST database.
When there's exactly one set of sequences defined, make_seq_set_colors
crashes:
$ igseq tree -P '.*' --input-format newick <(echo '((A:1,B:1):1)') out.nex
$ igseq tree -P A --input-format newick <(echo '((A:1,B:1):1)') out.nex
Traceback (most recent call last):
File "/home/jesse/miniconda3/envs/igseqhelper/bin/igseq", line 10, in <module>
sys.exit(main())
File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/__main__.py", line 89, in main
args.func(args)
File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/__main__.py", line 257, in _main_tree
tree.tree(
File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/tree.py", line 137, in tree
seq_colors = color_seqs(seq_ids, seq_sets_combo, colors)
File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/tree.py", line 296, in color_seqs
seq_set_colors_combo = make_seq_set_colors(seq_sets)
File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/tree.py", line 317, in make_seq_set_colors
subset = [int( a * (num-1) / (len(seq_sets)-1) ) for a in range(num)]
File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/tree.py", line 317, in <listcomp>
subset = [int( a * (num-1) / (len(seq_sets)-1) ) for a in range(num)]
ZeroDivisionError: division by zero
The logic for handling FASTA definition lines beyond the ID (after the first space) was never fully implemented, so things like igseq convert --col-seq-desc ...
do nothing. (I never tested for this?) The exact definitions of "sequence ID" versus "sequence description" versus "sequence definition" are also not clear at all.
The command igseq phix ... -c something.counts.csv
doesn't actually work because counts_out
is treated as a Path object in the code but left as a string as passed from the command line. It should probably be cast as a Path immediately in the phix function.
trim
should allow for custom constant (rather than varying per-sample) adapter sequences, instead of only the ones inferred from known barcode and adapter+primer sequences. This would make it usable on datasets prepped with other protocols as a more generic cutadapt wrapper.
When converting to tabular output, custom sequence description column names are ignored.
In record.py, this:
record["sequence_description"] = seq_desc
should be:
record[self.colmap["sequence_description"]] = seq_desc
Currently the column names for tabular input and/or output are stored in a dictionary for flexibility but that isn't actually linked up to the convert command's arguments. These should be command-line arguments. (Easy use case: pull out junction sequences from AIRR TSV)
igblastn uses single dashes in front of arguments, but those can be misinterpreted by argparse as a single-letter argument followed by an option with no space. For example -num_alignments_V 5
gives argument -n/--dry-run: ignored explicit argument 'um_alignments_V'
. I can't see how to handle that from the argparse side but a simple workaround would be to accept two dashes for igblastn arguments and then remove the extra dash internally.
The identity command uses a hardcoded reference to the sequence_id
column and ignores any custom column like from --col-seq-id
. (The existing tests check a custom sequence column, which works, but not a custom sequence ID column, which evidently doesn't.)
vdj-gather
can take fragments of builtin file paths as input, so you could do, for example:
igseq vdj-gather sonarramesh/IGK sonarramesh/IGH/IGHD -o igdiscover-db-start
...and get IGK genes plus a placeholder D file. But something like this results in D sequences being written twice:
igseq vdj-gather sonarramesh/IGH sonarramesh/IGH/IGHD -o igdiscover-db-start
Instead vdj.parse_vdj_paths
should condense duplicate FASTA file paths internally.
There are packages available (like python-newick) that support easily pretty-printing tree topology from newick-format trees. This could be a handy feature for the show
command.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.