gatkin / declxml Goto Github PK

View Code? Open in Web Editor NEW

35.0 6.0 7.0 238 KB

Declarative XML processing for Python

Home Page: https://declxml.readthedocs.io/en/latest/

License: MIT License

Python 99.27% Makefile 0.73%

xml xml-parser xml-serializer python xml-processing python3 python27

declxml's People

Contributors

Stargazers

Watchers

Forkers

hscally marnunez acarapetis rmr1154 strebitz chong601 huite

declxml's Issues

Access a node's parent attribute

Hi! Is this funcionality implemented?

xml.integer('..', attribute='nFrames', alias='_nFrames')]

I need for each child to have a copy of its parent "nFrames" attribute. Is this achievable with XPath?

Optional aggregate processors - a pain point

I'm confused about the behaviour of aggregate processors (dictionary and in particular user_object) when using required=False. Consider the following example:

>>> import declxml as xml
>>>
>>> person = xml.dictionary('person', [
...     xml.string('name', required=True),
... ], required=False)
>>> root = xml.dictionary('root', [
...     person,
... ])
>>>
>>> xml.parse_from_string(root, '<root></root>')
{'person': {}}
>>> xml.serialize_to_string(root, {'person': {}})
'<root />'

I find the default value {} for missing optional dictionaries a bit strange. It's consistent with the primitive processors (that return e.g. '' instead of None or a missing dict key for strings), but it's counterintuitive to me that {'person': {}} is a valid input to the root processor, but {} is invalid for the person processor:

>>> xml.serialize_to_string(person, {})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/centos/declxml/declxml.py", line 356, in serialize_to_string
    root = root_processor.serialize(value, state)
  File "/home/centos/declxml/declxml.py", line 1009, in serialize
    self._serialize(end_element, value, state)
  File "/home/centos/declxml/declxml.py", line 1041, in _serialize
    child.serialize_on_parent(element, child_value, state)
  File "/home/centos/declxml/declxml.py", line 1261, in serialize_on_parent
    state.raise_error(MissingValue, self._missing_value_message(parent))
  File "/home/centos/declxml/declxml.py", line 1369, in raise_error
    raise exception_type(error_message)
declxml.MissingValue: Missing required value for element "name" at person/name

This becomes particularly annoying when it comes to optional user_object processors: when the corresponding element is omitted from the XML, the parser produces an instance of the custom class with no attributes set, which I would think is rarely meaningful.

In my current codebase I'm using the following workaround in my base model class, which results in missing objects parsing to None:

    def __new__(cls, *args, **kwargs):
        if all(v is None for v in kwargs.values()):
            return None
        return super().__new__(cls)

This could alternatively be achieved with an after_parse hook. Either way, though, it's not perfect: in the case where the subprocessor validates against an empty element, empty elements appear as if they are missing, and as far as I can tell there's no way to distinguish them. For example, if we set required=False on person.name, the documents <root /> and <root><person/></root> will parse to the same structure.

Changing the way optional processors work would obviously be a large breaking change, but what do you think about adding a default= parameter to dictionary, user_object (and maybe array)? Then users could simply specify default=None.

None values for namedtuples and objects not handled when serializing

Serialization fails if a namedtuple or object value is None when the processor attempts to convert the namedtuple or object to a dictionary by calling _asdict() or __dict__ on the None value.

non-required named_tuple

Hi, first thanks for the nice library, looks well-designed and fits the needs of a project I am currently working on.

I have a (heavily) nested document structure, where I use a myriad of named tuples to represent the structure.

When testing it on a real-world example, the named tuple processors work OK in the case the related XML is there, but it does start acting up in case it's not -- even though I specifically set the processor to consider it non-required.

Snippets: (slightly altered)

Processor:

import declxml as xml


processor = xml.named_tuple(                                                            
    "Invoice",                                                                          
    Invoice,                                                                            
    [
...
        xml.named_tuple(                                                                 
            "Reference",                                                 
            Reference,                                                   
            [xml.string("ID")],                                                          
            required=False,                                                              
        ),   
    ]
)

Named tuple definition:

Reference  = namedtuple("Reference", ["ID"])

Issue I am getting is:

dict_value = {}

    def _from_dict(dict_value):
>       return tuple_type(**dict_value)
E       TypeError: __new__() takes exactly 2 arguments (1 given)

My best guess is that it has something to do with the named_tuple parsing, where it doesn't really check for any non-required values:

declxml/declxml.py

Line 1515 in afefe8a

return tuple_type(**dict_value)

Can you confirm, or am I missing something?

I did check when specifying it an array (much like the examples given in the docs), it does work, but I actually want a 0 or 1 relationship, rather than a 0 or n relationship.

Calling `xml.serialize_to_string()` without an indent leaves out `<?xml version="1.0" encoding="utf-8"?>`

Running

import json
import declxml as xml

processor = xml.dictionary('Container', [
    xml.string('.', attribute='test')
])

container = """
{
    "test": 1
}
"""

print('Without indent:', xml.serialize_to_string(processor, json.loads(container)))
print('-----------------------------')
print('With indent', xml.serialize_to_string(processor, json.loads(container), indent='   '))

produces

Without indent: <Container test="1" />
-----------------------------
With indent <?xml version="1.0" encoding="utf-8"?>
<Container test="1"/>

Is there any array attributes

I have an xml as follows:

<elements resource-type="VNF">
               <element>
                  <is-active>true</is-active>
                  <is-nullable>false</is-nullable>
                  <is-ref>false</is-ref>
                  <max-length>8</max-length>
                  <min-length>8</min-length>
                  <name-of-the-field>Site Code</name-of-the-field>
                  <sequence-id>2</sequence-id>
                  <value>ABC-1234</value>
               </element>
</elements>

How will I give the attribute name to this array.

seems following code is not working as attribute is not there for array:

resource_info_processor = xml.dictionary('resource-info',[
        xml.array(element_processor, attribute='resource-type', alias='elements'),
        xml.boolean('is-active')

        ]
    )

Send xml data to constructor

Hi! I really like declxml, and it almost does exactly what I was looking for. I understand if this is not something you want to support.

I would like to construct a custom class from an XML file, but do the transformations of the data in the constructor. As I understand the setup right now, attributes are "magically" set from declxml, instead of being sent to the constructor where I could transform them before saving them.

Is there a way to do this with declxml?
If not, is this something you would consider adding?

Order changed when seriallized to string

For this dictionary:

{'Name': 'Shear',
 'Type': 'RASResultsMap',
 'Checked': 'True',
 'Filename': '.\\Test Plan\\Shear Stress (08Aug2020 08 00 00).vrt',
 'MapParameters': [{'MapType': 'Shear',
   'LayerName': 'Shear Stress',
   'OutputMode': 'Stored Current Terrain',
   'StoredFilename': '.\\Test Plan\\Shear Stress (08Aug2020 08 00 00).vrt',
   'Terrain': 'CBR_041619',
   'ProfileIndex': '576',
   'ProfileName': '08Aug2020 08:00:00',
   'ArrivalDepth': '0'}]}

and this processor:

ras_processor = xml.dictionary('Layer',[
                                xml.string('.', attribute='Name'),
                                xml.string('.', attribute='Type'),
                                xml.string('.', attribute='Checked'),
                                xml.string('.', attribute='Filename'),
                                xml.array(xml.dictionary('MapParameters', [
                                    xml.string('.', attribute = "MapType"),
                                    xml.string('.', attribute = 'LayerName'),
                                    xml.string('.', attribute = 'OutputMode'),
                                    xml.string('.', attribute = 'StoredFilename'),
                                    xml.string('.', attribute = 'Terrain'),
                                    xml.string('.', attribute = 'ProfileIndex'),
                                    xml.string('.', attribute = 'ProfileName'),
                                    xml.string('.', attribute = 'ArrivalDepth')
                                    ])
                                    )])

After I run xml.serialize_to_string(ras_processor, test, indent=' '). The order of the attributes has changed to alphabetical. I would prefer to have the same attribute order as the input dictionary. Can this be done?

<?xml version="1.0" encoding="utf-8"?>
<Layer Checked="True" Filename=".\Test Plan\Shear Stress (08Aug2020 08 00 00).vrt" Name="Shear" Type="RASResultsMap">
    <MapParameters ArrivalDepth="0" LayerName="Shear Stress" MapType="Shear" OutputMode="Stored Current Terrain" ProfileIndex="576" ProfileName="08Aug2020 08:00:00" StoredFilename=".\Test Plan\Shear Stress (08Aug2020 08 00 00).vrt" Terrain="CBR_041619"/>
</Layer>

Thanks!

Array with nodes that have attributes AND values

Hello,

I have an XML of a structure:

<categories>
    <category parentId="" id="000000001" order="1" tagid="">Red</category>
    <category parentId="000000001" id="000000002" order="1" tagid="">Blue</category>
    <category parentId="000000002" id="000000254" order="1" tagid="">Green</category>
</categories>

So, you see, it's an array of dictionaries.

Could you please give me a hint how to parse such data?

Support for reading data from attributes on childless objects?

After reading through the docs here:

http://declxml.readthedocs.io/en/latest/guide.html#attributes

I realized the error I was receiving earlier was due to me trying to read attributes as if they were elements.

However, the docs as linked above only show how to read an attribute from a child element to add an attribute to the parent. What if the element-in-question has no children?

Following the example code, say the XML looked like this:

How would one parse this xml with declxml such that the resulting output had "favoriteFood" with a value of pizza as an attribute of author?

Parse failure on dictionary with attribute and value

When I parse this:

<Mode System="B">Cable</Mode>

with:

xml.dictionary('Mode', [
        xml.string('.', attribute='System')
    ])

the result when serialized and printed is:

<Mode System="B"/>

What happened to the value (CABLE)?

Full code:

    proc = xml.dictionary('Mode', [
        xml.string('.', attribute='System')
    ])
    bar = xml.parse_from_string(proc, '<?xml version="1.0" encoding="UTF-8"?><Mode System="B">Cable</Mode>')
    print(xml.serialize_to_string(proc, bar, indent='   '))

Create processor from typing annotations

Hi @gatkin,

This looks like a great package, providing a much saner way to interact with XML data; the documentation is complete and clear as well.

Primitive processors

Being able to use both classes and namedtuples is a very convenient, but I feel there's some duplication of info going on if you're using type annotations (as they've been added in recent Python version). To demonstrate what I mean:

from dataclasses import dataclass
import decxml as xml

@dataclass
class Extent:
    xmin: float
    ymin: float
    xmax: float
    ymax: float

extent_processor = xml.user_object("extent", Extent, [
    xml.floating_point("xmin"),
    xml.floating_point("ymin"),
    xml.floating_point("xmax"),
    xml.floating_point("ymax"),
]

I'm stating twice that the attributes should be floats. It's pretty straightforward to define a function which does this for you:

type_mapping = {
    bool: xml.boolean,
    int: xml.integer,
    float: xml.floating_point,
    string: xml.string,
}

def make_processor(datacls):
    fields = []
    for name, vartype in datacls.__annotations__.items():
        xml_type = type_mapping[vartype]
        field = xml_type(name)
        fields.append(field)
    return xml.user_object(datacls.__name__.lower(), datacls, fields)

extent_processor = make_processor(Extent)

This is all you need for simple processors (for typing.NamedTuple as well, mutatis mutandis).

Aggregate processors

Aggregate processors are easy to include via recursion, although you probably want to encode the "aggregateness" somewhere. After some playing around, I find encoding it in the type to be most straightforward:

import abc

class Aggregate(abc.ABC):
    pass

@dataclass
class Extent(Aggregate):
    xmin: float
    ymin: float
    xmax: float
    ymax: float

@dataclass
class SpatialData(Aggregate):
    epsg: str
    extent: Extent

def make_processor(datacls):
    fields = []
    for name, vartype in datacls.__annotations__.items():
        if issubclass(vartype, Aggregate):
            field = make_processor(vartype)
        else:
            xml_type = type_mapping[vartype]
            field = xml_type(name)
        fields.append(field)
    return xml.user_object(datacls.__name__.lower(), datacls, fields)

spatialdata_processor = make_processor(SpatialData)

This provides a very concise way of defining (nested) data structures -- which I'd generally want to do anyway -- and turn them into XML processors with a single function call and adding a new base class (which can even be monkey-patched at runtime, if needed).

I'm not sure you'd really want to put this in declxml (see the trouble below), but I do think it's useful (and non-trivial) enough to maybe warrant a section in the documentation. What do you reckon?

Optional, List, etc

I haven't tried it yet, but I'm pretty sure you can use typing.Optional and typing.List to map to the declxml equivalents.

Hickups

There's some trouble due to with the fact that XML has a separation between attributes and elements. For the XML's I'm working with, I don't really see a reason to separate between attributes and elements (of course, neither does JSON, or TOML, etc.) But you need to encode it somehow, or it won't end up the in the right place of the XML. But I can solve in it a slightly hacky way, by (ab)using typing.Union:

from typing import Union

class Attribute(abc.ABC):
    pass

@dataclass
class Example:
     a: Union[Attribute, int]
     b: int
     c: int

example = Example(1, 2, 3)

To write an XML:

<example a=1>
<b>2</b>
<c>3</c>
</example>

We can check again by inspecting the annotations:

def is_union(vartype):
    return hasattr(vartype, "__args__") and (vartype.__args__[0] is Attribute)

This shouldn't trip up any type checker, but it is clearly not quite intended use: you'll never provide an Attribute as the value.

There's more issues with the fact that sometimes you need to include names that aren't part of the dataclass or the namedtuple, e.g. an array in the xml, where every entry is tagged "item":

<item value="-5980.333333333333" label="-5980" alpha="255" color="#0d0887"/>
<item value="-5863.464309399999" label="-5863" alpha="255" color="#1b068d"/>

I can't use something as general as "item" as my class name. This how I want to see it in Python:

@dataclass
class Color(Attribute):
    value: str
    label: str
    alpha: str
    color: str

Of course, I can just fall back to regular use at any time, and provide the name which is only part of the processor, not of the dataclass:

color_processor = xml.user_object(
    "item",
    Color,
    [
        xml.string(".", attribute="value"),
        xml.string(".", attribute="label"),
        xml.string(".", attribute="alpha"),
        xml.string(".", attribute="color"),
    ],
)

At any rate, you can just mix and match as needed: when everything's encoded in the dataclass or namedtuple, you can generate the processors automatically; if not, you just have to write a few extra lines or provide an explicit name.

Similarly, there's cases where aliases are required. In my case, I'm lowering class names and replacing underscores by dashes: so it's sorta implicitly defined. Stuff like this makes me think it might be smarter to let the user figure out the details of their idiosyncratic XML format, and provide a "base recipe" to help them along a little.

Or perhaps you see a better way that is nice and general?

Cannot parse XML into frozen attrs classes

Parsing XML into an immutable object such as

@attr.s(frozen=True)
class Book(object):

    title = attr.ib()
    author = attr.ib()

fails because the class does not allow calls to the __setattr__ method

Recursive Elements

Is there a way to model a recursive element?

e.g.

<foos>
    <foo>
        <bar>hello world</bar>
        <foos>
            <foo>
                <bar>blah blah</bar>
                <foos/>
            </foo>
        </foos>
    </foo>
</foos>

where a list of foos has a foo which itself may have a list of foos

Add type hints to public API

For better documentation and integration with other projects that prefer to use type hints, all public API functions should be updated to include type hints.

Exception messages are too vague

First, thanks for writing this. It's a great little tool that does it's job well, and I love the declaritive style.

One issue I wanted to point out though is that the exceptions that are thrown leave much to be desired in terms of usefulness. I'm stuck right now on processing an xml file because I get:

declxml.MissingValue: Missing required element: "name"

In my case, I've deduced that the problem is likely something not being marked as optional that should be and I'm sure I'll track it down shortly, but I had to figure that out on my own. At the VERY least, it would be extremely helpful if the error message explained what the current state of the parser was at the time it found an issue. What node was the parser looking at where it couldn't find an attribute "name"? Or is this an issue with the user_object being passed in not having an attribute called "name"? It's too vague which lead to me having to spend quite a bit of extra time stepping through declxml code to figure out what was going on.

Just a suggestion.

Parser fails on non ASCII strings on python 2

Under python2, using non ASCII characters raises UnicodeEncodeError.
How would you like this to be addressed?

Use attributes as keys for dictionaries

Hi! I'm trying to parse a tricky piece of XML that has some important information in attributes, that I would like to turn into a dictionary. Instead of explaining this in writing, here's some code... I'm trying to write a processor that would make the assert True:

import declxml as xml

data = """
    <item>
        <poll name="suggested_numplayers" title="User Suggested Number of Players" totalvotes="25">
            <results numplayers="1">
                <result value="Best" numvotes="0"/>
                <result value="Recommended" numvotes="0"/>
                <result value="Not Recommended" numvotes="9"/>
            </results>
            <results numplayers="2">
                <result value="Best" numvotes="4"/>
                <result value="Recommended" numvotes="12"/>
                <result value="Not Recommended" numvotes="4"/>
            </results>
            <results numplayers="3">
                <result value="Best" numvotes="11"/>
                <result value="Recommended" numvotes="7"/>
                <result value="Not Recommended" numvotes="1"/>
            </results>
        </poll>
    </item>
"""

processor = ...

parsed = xml.parse_from_string(processor, data)
assert parsed == {
    "players": {
        "1": {"Best": 0, "Recommended": 0, "Not Recommended": 9},
        "2": {"Best": 4, "Recommended": 12, "Not Recommended": 4},
        "3": {"Best": 11, "Recommended": 7, "Not Recommended": 1},
    }
}, parsed

I've tried two different ways to get this working.

Using XPath element/@attribute syntax to "select" the value of the element and use the result of that as the key of the dictionary:

processor = xml.dictionary('item', [
    xml.dictionary("poll[@name='suggested_numplayers']", [
        xml.dictionary("results/@numplayers", [
            xml.dictionary("result/@value", [
                xml.integer("result", attribute="numvotes"),
            ])
        ])
    ], alias="players"),
])

This fails with "KeyError: '@'", likely because it searches for an element, not an attribute.

Making each key a primitive processor and using attribute to filter down to the element's value.

processor = xml.dictionary('item', [
    xml.dictionary("poll[@name='suggested_numplayers']", [
        xml.dictionary(xml.string("results", attribute="numplayers"), [
            xml.dictionary(xml.string("result", attribute="value"), [
                xml.integer("result", attribute="numvotes"),
            ])
        ])
    ], alias="players"),
])

This fails with "TypeError: '_PrimitiveValue' object is not subscriptable", likely because it doesn't expect a primitive processor there, but a string.

Is there any way to get the result I'm looking for? Anything similar to what I'm looking for?

declxml on Arch Linux

Hi @gatkin. Great project! I started using declxml this week and hope to contribute in the future.

I wanted to let you know your project is available on Arch Linux platforms:
https://aur.archlinux.org/packages/python-declxml-git/