GithubHelp home page GithubHelp logo

igseq's People

Contributors

ressy avatar

Stargazers

 avatar

Watchers

 avatar

igseq's Issues

Exceptions in the igblast stdin thread need to be handled

If an error occurs during igblast inside the thread handling the input (e.g. "igseq.util.IgSeqError: format not detected from filename (foo.bar). specify a format manually.") it currently goes nowhere. These should pass along to the main thread.

igseq msa crashes with only one input sequence

igseq msa crashes when asked to make an MSA of just a single sequence, because the underlying call to MUSCLE segfaults. It should just bypass the call to MUSCLE for a single input sequence as it already does for no input sequences.

Remove ORF sequences from SONARRamesh rhesus germline set

I'd originally left rhesus germline V genes with the "ORF" prefix in place as SONAR has them, but I should probably remove them. IgDiscover will very occasionally detect (technically correctly, as far as I can tell, based on the removed introns) transcribed antibody recombinations for V genes labeled non-functional ORFs in Ramesh et al. 2017's rhesus reference, but, these are still presumably non-functional from a protein standpoint, so we'd rather not have them reported in the output. The names also cause trouble for some of SONAR's parsing code. Removing them would make the germline sequences here consistently limited to functional cases.

ORF_IGHV3-AHA-X*01
ORF_IGHV3-AHG-X*01
ORF_IGHV3-AHH-X*01
ORF_IGKV1-AES-X*01
ORF_IGKV2-AEQ-X*01
ORF_IGKV2-AET-X*01
ORF_IGKV2-AET-X*03
ORF_IGKV2-AET-X*04
ORF_IGLV7-ADJ*01
ORF_IGLV7-ADJ*02
ORF_IGLV8-AEE-X*01
ORF_IGLV8-AEE-X*02
ORF_IGLV8-AEE-X*03

getreads fails with bcl2fastq > 2.17

Versions of bcl2fastq > 2.17 don't support --demultiplexing-threads so getreads crashes. There's nothing I can see in the release notes about that but it's mentioned by NIH.

Probably should just remove that argument entirely here. Even the latest bcl2fastq is 5+ years old at this point.

igblast command should stream input/output

The igblast command gathers up output in one big string instead of processing it as it comes, potentially using up way too much RAM. This should instead use subprocess' Popen class (like the trimming code already does).

Color-coding in trees should default to simple set handling

I made the tree color-coding logic to help summarize what sequences are present in different overlapping sets (e.g. before and after IgDiscover), so by default it merges colors across sets for sequences in multiple sets. But most of the time it's much less confusing to just have colors map directly to sequences with some default for the multi-set case. That should be the default, with the color-merging a separate option.

Stderr/stdout should be shown if igblast crashes

Right now if a subprocess.CalledProcessError is raised because of a nonzero exit code from igblastn, the error message is inside the stderr attribute of the exception, but it's never actually shown so the cause of the problem isn't clear. The stdout/stderr attributes of the raised exception should be written to the respective streams before the exception is re-raised.

Expected igblastn output should be synced with pinned installed version

I made test output files with IgBLAST 1.19.0, but IgBLAST 1.21.0 added new columns to its AIRR output (sequence_aa, d_frame) and in general its output does change between minor releases. This is making the igblast.sh example script fail when it compares expected and actual TSV outputs. I should pin igblast at a specific version here and make the saved outputs here match that version.

record writer crashes on unexpected sequence_description key

When RecordWriter tries to write a record containing sequence_description, and the first record didn't have a description, it crashes because there was no sequence_description in the initial field names. This bug showed up in #51.

For example:

$ echo -e '>seq1 desc\nACTG\n>seq2\nACTG' | igseq convert - - --input-format fa --output-format csv
sequence_id,sequence,sequence_description
seq1,ACTG,desc
seq2,ACTG,
$ echo -e '>seq1\nACTG\n>seq2 desc\nACTG' | igseq convert - - --input-format fa --output-format csv
sequence_id,sequence
seq1,ACTG
Traceback (most recent call last):
  File "/home/jesse/miniconda3/envs/igseqhelper/bin/igseq", line 10, in <module>
    sys.exit(main())
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/__main__.py", line 89, in main
    args.func(args)
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/__main__.py", line 225, in _main_convert
    convert.convert(
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/convert.py", line 23, in convert
    writer.write(record)
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/record.py", line 258, in write
    self.writer.writerow(record)
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/csv.py", line 154, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/csv.py", line 149, in _dict_to_list
    raise ValueError("dict contains fields not in fieldnames: "
ValueError: dict contains fields not in fieldnames: 'sequence_description'

A cut feature would be handy

An igseq cut subcommand could allow a simple interface to extract different antibody regions (e.g. FWR1-FWR3 for to get V genes truncated before CDR3). This should be fairly straightforward if based on the IgBLAST AIRR TSV columns that give start and end positions relative to the query sequence.

tree color-coding crashes when exactly one set is defined

When there's exactly one set of sequences defined, make_seq_set_colors crashes:

$ igseq tree -P '.*' --input-format newick <(echo '((A:1,B:1):1)') out.nex
$ igseq tree -P A --input-format newick <(echo '((A:1,B:1):1)') out.nex
Traceback (most recent call last):
  File "/home/jesse/miniconda3/envs/igseqhelper/bin/igseq", line 10, in <module>
    sys.exit(main())
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/__main__.py", line 89, in main
    args.func(args)
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/__main__.py", line 257, in _main_tree
    tree.tree(
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/tree.py", line 137, in tree
    seq_colors = color_seqs(seq_ids, seq_sets_combo, colors)
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/tree.py", line 296, in color_seqs
    seq_set_colors_combo = make_seq_set_colors(seq_sets)
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/tree.py", line 317, in make_seq_set_colors
    subset = [int( a * (num-1) / (len(seq_sets)-1) ) for a in range(num)]
  File "/home/jesse/miniconda3/envs/igseqhelper/lib/python3.9/site-packages/igseq/tree.py", line 317, in <listcomp>
    subset = [int( a * (num-1) / (len(seq_sets)-1) ) for a in range(num)]
ZeroDivisionError: division by zero

Sequence descriptions aren't handled

The logic for handling FASTA definition lines beyond the ID (after the first space) was never fully implemented, so things like igseq convert --col-seq-desc ... do nothing. (I never tested for this?) The exact definitions of "sequence ID" versus "sequence description" versus "sequence definition" are also not clear at all.

PhiX mapping counts path should be cast to Path object

The command igseq phix ... -c something.counts.csv doesn't actually work because counts_out is treated as a Path object in the code but left as a string as passed from the command line. It should probably be cast as a Path immediately in the phix function.

trim should allow custom adapters

trim should allow for custom constant (rather than varying per-sample) adapter sequences, instead of only the ones inferred from known barcode and adapter+primer sequences. This would make it usable on datasets prepped with other protocols as a more generic cutadapt wrapper.

Custom sequence description columns are ignored

When converting to tabular output, custom sequence description column names are ignored.

In record.py, this:

record["sequence_description"] = seq_desc

should be:

record[self.colmap["sequence_description"]] = seq_desc

Custom columns should be supported for convert command

Currently the column names for tabular input and/or output are stored in a dictionary for flexibility but that isn't actually linked up to the convert command's arguments. These should be command-line arguments. (Easy use case: pull out junction sequences from AIRR TSV)

Extra igblast arguments can clash with argparse

igblastn uses single dashes in front of arguments, but those can be misinterpreted by argparse as a single-letter argument followed by an option with no space. For example -num_alignments_V 5 gives argument -n/--dry-run: ignored explicit argument 'um_alignments_V'. I can't see how to handle that from the argparse side but a simple workaround would be to accept two dashes for igblastn arguments and then remove the extra dash internally.

Duplicate inferred FASTA paths should only be handled once

vdj-gather can take fragments of builtin file paths as input, so you could do, for example:

igseq vdj-gather sonarramesh/IGK sonarramesh/IGH/IGHD -o igdiscover-db-start

...and get IGK genes plus a placeholder D file. But something like this results in D sequences being written twice:

igseq vdj-gather sonarramesh/IGH sonarramesh/IGH/IGHD -o igdiscover-db-start

Instead vdj.parse_vdj_paths should condense duplicate FASTA file paths internally.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.