mobiusklein / psims Goto Github PK

View Code? Open in Web Editor NEW

19.0 4.0 9.0 219.28 MB

A declarative API for writing XML documents for HUPO PSI-MS mzML and mzIdentML

Home Page: https://mobiusklein.github.io/psims/

License: Apache License 2.0

Python 99.93% Makefile 0.07%

xml-serialization mzml mzidentml obo

psims's Introduction

psims

Prototype work for a unified API for writing Proteomics Standards Initiative standardized formats for mass spectrometry:

mzML
mzIdentML
mzMLb

See the Documenation for more information

Installation

With pip:

pip install psims

With conda:

conda install -c bioconda -c conda-forge -c defaults psims

mzML Minimal Example

from psims.mzml.writer import MzMLWriter

# Load the data to write
scans = get_scan_data()

with MzMLWriter(open("out.mzML", 'wb'), close=True) as out:
    # Add default controlled vocabularies
    out.controlled_vocabularies()
    # Open the run and spectrum list sections
    with out.run(id="my_analysis"):
        spectrum_count = len(scans) + sum([len(products) for _, products in scans])
        with out.spectrum_list(count=spectrum_count):
            for scan, products in scans:
                # Write Precursor scan
                out.write_spectrum(
                    scan.mz_array, scan.intensity_array,
                    id=scan.id, params=[
                        "MS1 Spectrum",
                        {"ms level": 1},
                        {"total ion current": sum(scan.intensity_array)}
                     ])
                # Write MSn scans
                for prod in products:
                    out.write_spectrum(
                        prod.mz_array, prod.intensity_array,
                        id=prod.id, params=[
                            "MSn Spectrum",
                            {"ms level": 2},
                            {"total ion current": sum(prod.intensity_array)}
                         ],
                         # Include precursor information
                         precursor_information={
                            "mz": prod.precursor_mz,
                            "intensity": prod.precursor_intensity,
                            "charge": prod.precursor_charge,
                            "scan_id": prod.precursor_scan_id,
                            "activation": ["beam-type collisional dissociation", {"collision energy": 25}],
                            "isolation_window": [prod.precursor_mz - 1, prod.precursor_mz, prod.precursor_mz + 1]
                         })

Citing

If you use psims in an academic project, please cite:

Klein, J. A., & Zaia, J. (2018). psims - A declarative writer for mzML and mzIdentML for Python. Molecular & Cellular Proteomics, mcp.RP118.001070. https://doi.org/10.1074/mcp.RP118.001070

psims's People

Contributors

Stargazers

Watchers

Forkers

jb-ms bostonuniversitycbms inambioinfo sean-peters-au speters-cmri aspirincode mwang87 ward-ryei ralfg

psims's Issues

Problem with writing precursor information

Hi,

I'm trying to use psims for writing mzML, however I'm running into problems when writing ms2 precursor information

.....
    id_format_str = "controllerType=0 controllerNumber=1 scan={i}"
    with MzMLWriter(file) as writer:
        # Add default controlled vocabularies
        writer.controlled_vocabularies()
        # Open the run and spectrum list sections
        with writer.run(id="Simulated Run"):
            spectrum_count = len(scans) + sum([len(products) for _, products in scans])
            with writer.spectrum_list(count=spectrum_count):
                for scan, products in scans:
                    # Write Precursor scan
                    try:
                        index_of_max_i = np.argmax(scan.i)
                        max_i = scan.i[index_of_max_i]
                        mz_at_max_i = scan.mz[index_of_max_i]
                    except ValueError:
                        mz_at_max_i = 0
                    # breakpoint()
                    writer.write_spectrum(
                        scan.mz,
                        scan.i,
                        id=id_format_str.format(i=scan.id),
                        params=[
                            "MS1 Spectrum",
                            {"ms level": 1},
                            {"scan start time": scan.retention_time},
                            {"total ion current": sum(scan.i)},
                            {"base peak m/z": mz_at_max_i},
                            {"base peak intensity": max_i},
                        ],
                    )
                    # Write MSn scans
                    for prod in products:
                        writer.write_spectrum(
                            prod.mz,
                            prod.i,
                            id=id_format_str.format(i=prod.id),
                            params=[
                                "MSn Spectrum",
                                {"ms level": 2},
                                {
                                    "scan start time": scan.retention_time,
                                    "unitName": "seconds",
                                },
                                {"total ion current": sum(prod.i)},
                            ],
                            # TOFIX adding precursor information makes psims crash
                            # Include precursor information
                            # precursor_information=prec_info
                            # precursor_information={
                            #     "mz": prod.precursor_mz,
                            #     "intensity": prod.precursor_i,
                            #     "charge": prod.precursor_charge,
                            #     "scan_id": prod.precursor_scan_id,
                            precursor_information={
                                "mz": 300.0,
                                "intensity": 5e6,
                                "charge": 2,
                                "scan_id": id_format_str.format(i=scan.id),
                            },
                            # },
                        )

This crashes with the following error:

cls = <class 'psims.mzml.components.Activation'>, obj = None
kwargs = {'context': {'InstrumentConfiguration': InstrumentConfiguration
{}, 'DataProcessing': DataProcessing
{}, 'Spectrum': Spectrum
{'controllerType=0 controllerNumber=1 scan=0': 'controllerType=0 controllerNumber=1 scan=0'}}}

    @classmethod
    def ensure(cls, obj, **kwargs):
        if isinstance(obj, cls):
            return obj
        else:
>           kwargs.update(obj)
E           TypeError: 'NoneType' object is not iterable

.tox/py38/lib/python3.8/site-packages/psims/document.py:704: TypeError

I'm using python 3.8 and psims 0.1.30

cls = <class 'psims.mzml.components.Activation'>, obj = None makes me think, that I have to set activation method or energy somehow and somewhere, is this correct and could you point me in the right direction on how to fix that?

Best Manuel

Recommended usage pattern for extracting & saving just header info from mzML

If I have an mzML file and I wish to use pyteomics (to load) & psims (to write out) is there a suggested usage pattern to achieve the following:

Load mzML file
Extract just the header/metadata (i.e. exclude spectra and chromatograms)
Save this header/metadata to a new mzML file

(Not sure if this is the best place to ask).

NameError: name 'err' is not defined

from psims.transform.mzml import MzMLToMzMLb


for fn in tqdm(fn_mzML):
    fn_out = fn.replace('mzML', 'mzMLb')
    print(fn, '-->', fn_out)
    MzMLToMzMLb(fn, fn_out).write()

> ./input/MTBLS1569/mzML/Control_3h_R2.mzML --> ./input/MTBLS1569/mzMLb/Control_3h_R2.mzMLb

Threw:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-c3d979457c34> in <module>
      5     fn_out = fn.replace('mzML', 'mzMLb')
      6     print(fn, '-->', fn_out)
----> 7     MzMLToMzMLb(fn, fn_out).write()

~/miniconda3/envs/ms-mint/lib/python3.9/site-packages/psims/transform/mzml.py in __init__(self, input_stream, output_stream, transform, transform_description, sort_by_scan_time, **hdf5args)
    340         self.sort_by_scan_time = sort_by_scan_time
    341         self.reader = MzMLParser(input_stream, iterative=True)
--> 342         self.writer = MzMLbWriter(output_stream, **hdf5args)
    343         self.psims_cv = self.writer.get_vocabulary('PSI-MS').vocabulary

~/miniconda3/envs/ms-mint/lib/python3.9/site-packages/psims/mzmlb/__init__.py in __init__(self, *args, **kwargs)
      4     class MzMLbWriter(object):
      5         def __init__(self, *args, **kwargs):
----> 6             raise err

NameError: name 'err' is not defined

installed with pip.

Downloading psims-0.1.36-py2.py3-none-any.whl (11.6 MB)

Specify version of CV to use?

My unit tests fail each time the CV obo versions change, is there any way to control which version is used when calling writer.controlled_vocabularies()?

e.g.

$ diff a.mzml b.mzml 
5,6c5,6
<       <cv URI="https://raw.githubusercontent.com/HUPO-PSI/psi-ms-CV/master/psi-ms.obo" fullName="PSI-MS" id="PSI-MS" version="4.1.24"/>
<       <cv URI="http://ontologies.berkeleybop.org/uo.obo" fullName="UNIT-ONTOLOGY" id="UO" version="releases/2019-03-29"/>
---
>       <cv URI="https://raw.githubusercontent.com/HUPO-PSI/psi-ms-CV/master/psi-ms.obo" fullName="PSI-MS" id="PSI-MS" version="4.1.25"/>
>       <cv URI="http://ontologies.berkeleybop.org/uo.obo" fullName="UNIT-ONTOLOGY" id="UO" version="releases/2019-04-01"/>
56c56
<   <fileChecksum>25aefe7cfa862b2d10c1de97b8610e8bba2874d9</fileChecksum>
---
>   <fileChecksum>19cac13f7e784d39512d32eae6d77a14319f8651</fileChecksum>

mzIdentML Usage Patterns

The API for MzIdentMLWriter is much rougher than MzMLWriter, and was designed with the assumption the user would provide generators to lazily provide all of the large collections. This makes it less intuitive to build in complex logic to prepare members of those collections on the fly.

I'm opening this issue to initiate discussion of what would be ideal. @JB-MS, if you have a chance to take a look and let me know how you plan to use the API, we might come to a better design.

Conda package

Hi @mobiusklein,

While setting up a new conda environment, I noticed psims is not in bioconda yet. If you want, I can take a look at drafting a bioconda recipe and some GitHub action workflows for automated testing, building and uploading to PyPI etc.

Let me know what you think,
Ralf

[Maintainance] Deprecation warnings for sqlalchemy 2.0

I noticed that the current implementation is trowing deprecation warning due to sqlalchemy.

... site-packages/psims/controlled_vocabulary/unimod.py:49: MovedIn20Warning: The ``declarative_base()`` function is now available as sqlalchemy.orm.declarative_base(). (deprecated since: 2.0) (Background on SQLAlchemy 2.0 at: https://sqlalche.me/e/b8d9)
    Base = declarative_base(metaclass=SubclassRegisteringDeclarativeMeta)

LMK if you would like any help with a PR that starts using the new API

Compatibility with other tools

This is a great tool! I love it. One suggestion for other users, is the output is not fully compatible with tools taking in mzML. To help with this, simply doing a round trip mzML -> mzML conversion in proteowizard's msconvert totally fixes things.

Binary string in IndexedMzML?

The index id doesn't match the spectrum id, possibly due to py3 binary string issues?
Spectrum id:

<spectrum index="0" defaultArrayLength="0" id="sample=0 period=0 cycle=0 experiment=0">

Offset id:

<offset idRef="b'sample=0 period=0 cycle=0 experiment=0'">2729</offset>

This then causes massive slow-down when parsing the file using mzml.PreIndexedMzML (3 orders of magnitude).

Expected:
Offset id shouldn't have b'' wrapping it. E.g.:

<offset idRef="sample=0 period=0 cycle=0 experiment=0">2729</offset>

(MzIdentMLWriter) register AttributeError

Hi Joshua,
I've been following the example from the documentation running
writer.SearchDatabase.register("search_db_1") or in fact any other register() throws:

AttributeError: type object 'SearchDatabase' has no attribute 'kwargs'. Did you mean: 'args'?

Stacktrace below:

  File "/home/lars/work/repos/psims/test.py", line 34, in <module>
    writer.SearchDatabase.register("search_db_1")
  File "/home/lars/.local/share/virtualenvs/psims-PrEsMToZ/lib/python3.10/site-packages/psims/document.py", line 426, in register
    return self.kwargs['context'].register(self._func.__name__, identifier)
  File "/home/lars/.local/share/virtualenvs/psims-PrEsMToZ/lib/python3.10/site-packages/psims/document.py", line 412, in __getattr__
    return getattr(self._func, name)
AttributeError: type object 'SearchDatabase' has no attribute 'kwargs'. Did you mean: 'args'?```

(MzIdentMLWriter) SpectrumIdentification.write TypeError missing positional argument xml_file

Hi Joshua,
another error I ran into when following the example.

When running the AnalysisCollection SpectrumIdentification(...).write I get a TypeError:

with writer.analysis_collection():
    writer.SpectrumIdentification(spectra_data_ids_used=[
        "mgf_1", "mgf_2", "mgf_3"
    ], search_database_ids_used=[
        "search_db_1"
    ], spectrum_identification_list_id='spectra_identified_list_1',
        spectrum_identification_protocol_id='spectrum_identification_params_1').write()

Throws:
spectrum_identification_protocol_id='spectrum_identification_params_1').write() TypeError: ComponentBase.write() missing 1 required positional argument: 'xml_file'

Files not closing (was: Truncated files)

Files are being truncated in the middle or the end of the final spectrum, without the index being written. Sporadically, everything works but it is not typically so.

A copy and paste is not displaying properly. I will attach an example shortly.

Update package metadata on PyPI?

Right now proforma tests in Pyteomics are failing on Python 2.7, 3.6 and 3.7. Apparently the latest psims doesn't work on these versions, but it is still installed on those when running pip install psims.

Would this be fixed by updating the PyPI entry?

Edit: on second thought, the current latest release would probably have to be removed completely, or pip-3.7 and the others will choose it over the whatever newer version is uploaded.

[Question] Adding 1/k0 information to selected precursor ions

Hello Joshua!

I hope everything is going well!
I had a question on the usage of the package. I am trying to write an mzML where the spectra
has ion mobility information and I was wondering if this was currently supported (or the option to pass a custom entry at a specific place). Let me know if you can point me to a place in the documentation where I could find that!

Here is a sample of the arguments I am trying to pass to the writer.

sample = {
            "mz_array": mz,
            "intensity_array": intensity,
            "scan_start_time": rt,
            "id": f"scan={i+offset}",
            "params": [
                "MSn Spectrum",
                {"ms level": 2},
                {"total ion current": sum(intensity)},
            ],
            "precursor_information": {
                "activation": [
                    "beam-type collisional dissociation",
                    {"collision energy": 25},
                ],
                "mz": quad_mid,
                "intensity": sum(intensity),
                "charge": 2,
                "isolation_window": [quad_low, quad_mid, quad_high],
                "scan_id": last_precursor_scan_id,
                "selectedIonList": [
                    {
                        ########## HERE ###########
                        "inverse reduced ion mobility": mobility_value,
                    }
                ],
            },
        }

with IndexedMzMLWriter("outfile.mzml") as out:
    ...
    out.write_spectrum(**sample)

And I would like the precursor list section of the mzml to look something like this:

            <precursorList count="1">
                <precursor>
                    <isolationWindow>
                        <cvParam cvRef="MS" accession="MS:1000827" name="isolation window target m/z" value="457.723968505859" unitAccession="MS:1000040" unitName="m/z" unitCvRef="MS" />
                        <cvParam cvRef="MS" accession="MS:1000828" name="isolation window lower offset" value="1.5" unitAccession="MS:1000040" unitName="m/z" unitCvRef="MS" />
                        <cvParam cvRef="MS" accession="MS:1000829" name="isolation window upper offset" value="0.75" unitAccession="MS:1000040" unitName="m/z" unitCvRef="MS" />
                    </isolationWindow>
                    <selectedIonList count="1">
                        <selectedIon>
                            <cvParam cvRef="MS" accession="MS:1000744" name="selected ion m/z" value="457.723968505859" unitAccession="MS:1000040" unitName="m/z" unitCvRef="MS" />
                            <cvParam cvRef="MS" accession="MS:1000041" name="charge state" value="2" />
                            <cvParam cvRef="MS" accession="MS:1002815" name="inverse reduced ion mobility" value="1.078628" unitAccession="MS:1002814" unitName="volt-second per square centimeter"/>
                        </selectedIon>
                    </selectedIonList>
                    <activation>
                        <cvParam cvRef="MS" accession="MS:1000133" name="collision-induced dissociation" />
                        <cvParam cvRef="MS" accession="MS:1000045" name="collision energy" value="35.0"/>
                    </activation>
                </precursor>
            </precursorList>

Error on exit from mzML

I'm getting a sporadic exception that occurs in the exit function:

  File "/opt/conda/lib/python3.6/site-packages/psims/xml.py", line 866, in __exit__
    self.end(exc_type, exc_value, traceback)
  File "/opt/conda/lib/python3.6/site-packages/psims/xml.py", line 873, in end
    self.xmlfile.__exit__(exc_type, exc_value, traceback)
  File "/opt/conda/lib/python3.6/site-packages/psims/xml.py", line 747, in __exit__
    self.xmlfile.__exit__(*args)
  File "src/lxml/serializer.pxi", line 925, in lxml.etree.xmlfile.__exit__
  File "src/lxml/serializer.pxi", line 1263, in lxml.etree._IncrementalFileWriter._close
  File "src/lxml/serializer.pxi", line 1269, in lxml.etree._IncrementalFileWriter._handle_error
  File "src/lxml/serializer.pxi", line 199, in lxml.etree._raiseSerialisationError
lxml.etree.SerialisationError: unknown error -1758357554

Is this likely to be in the way we are using it? Or something else?

Multiple unit options are possible for parameter 'base peak intensity' but none were specified

We are using psims to create mzml from tdf files using the bruker sdk. When assigning units to base peak intensity I get the following warning.

/usr/local/lib/python3.8/dist-packages/psims/mzml/components.py:585: AmbiguousTermWarning: Multiple unit options are possible for parameter 'base peak intensity' but none were specified
self._check_params()

/usr/local/lib/python3.8/dist-packages/psims/mzml/components.py:614: AmbiguousTermWarning: Multiple unit options are possible for parameter 'base peak intensity' but none were specified
self.write_params(xml_file)

The code section to assign the unit is below.

    mzml_data_struct['writer'].write_spectrum(
        ms1_mz_array, 
        ms1_i_array, 
        id=mzml_data_struct['current_precursor']['spectrum_id'], 
        centroided=centroided_flag,
        scan_start_time=scan_start_time, 
        scan_window_list=[( mzml_data_struct['data_dict']['mz_acq_range_lower'] , mzml_data_struct['data_dict']['mz_acq_range_upper'] )],
        compression=mzml_data_struct['compression'],
        params=[
                {"ms level": 1}, 
                {"total ion current": total_ion_intensity},
                {"base peak intensity": base_peak_intensity, 'unit_accession': 'MS:1000131'},
                {"base peak m/z": base_peak_mz, 'unit_name': 'm/z'}
            ]
        )

Import issues on Python 3.12

Importing psims fails on Python 3.12 with:

>>> import psims
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/lev/.pyenv/versions/py3.12/lib/python3.12/site-packages/psims/__init__.py", line 6, in <module>
    from .mzml import (
  File "/home/lev/.pyenv/versions/py3.12/lib/python3.12/site-packages/psims/mzml/__init__.py", line 1, in <module>
    from .writer import (
  File "/home/lev/.pyenv/versions/py3.12/lib/python3.12/site-packages/psims/mzml/writer.py", line 27, in <module>
    from psims.xml import CVParam, XMLWriterMixin, XMLDocumentWriter
  File "/home/lev/.pyenv/versions/py3.12/lib/python3.12/site-packages/psims/xml.py", line 19, in <module>
    from .validation import validate
  File "/home/lev/.pyenv/versions/py3.12/lib/python3.12/site-packages/psims/validation/__init__.py", line 1, in <module>
    from .validator import validate
  File "/home/lev/.pyenv/versions/py3.12/lib/python3.12/site-packages/psims/validation/validator.py", line 1, in <module>
    import pkg_resources
ModuleNotFoundError: No module named 'pkg_resources'

Apparently, this is fixed with pip install setuptools.

I noticed this while dealing with a similar problem with pyteomics, Curiously, import pyteomics does not fail for me locally the same way it does in a Github runner, nor do most of the tests except proforma (which fails due to to the error above making it think psims is not installed). Hopefully, adding setuptools as a requirement should fix all issues, though, otherwise a proper replacement of pkg_resources with a more modern API is needed.