symbiflow / uxsdcxx Goto Github PK

View Code? Open in Web Editor NEW

6.0 10.0 2.0 379 KB

generate C++ reader/writer from XSD schema

License: Apache License 2.0

Python 93.00% Makefile 0.47% C++ 6.53%

xsd pugixml xml xsd-schema xml-schema xml-serialization symbiflow

uxsdcxx's Introduction

uxsdcxx

Disclaimer: Pre-1.0.0 software. Support for anything might break.

uxsdcxx is a tool to generate PugiXML-based C++ reader, validator and writer from an XSD schema. It can generate code for a subset of XSD 1.0.

It currently supports:

Simple types with following exceptions:
- xs:lists are just read into a string.
- Only enumerations are supported as xs:restrictions of simple types.
- Restricted string types such as IDREF, NCName etc. aren't validated.
Complex types.
Model groups(all, sequence and choice)
Elements.
Attributes except xs:anyAttribute.
- Default values are supported.
- When writing, non-zero default values are always written out.

It currently does not support:

Anything that PugiXML can't read:
- XML namespaces

Getting started

pip install uxsdcxx. Use with uxsdcxx.py foo.xsd. Two files foo_uxsdcxx.h and foo_uxsdcxx.cpp will be created.

API

All uxsdcxx functions live in a namespace uxsd.

1. Root element class

For every root element in the schema, a class is generated with load and write functions. For instance,

<xs:element name="foo">
  <xs:complexType>
  ...
  </xs:complexType>
</xs:element>

results in this C++ code:

class foo : public t_foo {
public:
    pugi::xml_parse_result load(std::istream &is);
    void write(std::ostream &os);
};

load() loads from an input stream into this root element's structs and write() writes its content to a given output stream.

Note that root elements with simple types are not supported.

2. Pools

uxsdcxx generates global pools to store multiply-occurring types.

<xs:complexType name="foo">
  <xs:sequence>
    <xs:element name="bar" type="bar" maxOccurs="unbounded"/>
  </xs:sequence>
</xs:complexType>

generates:

extern std::vector<t_bar> bar_pool;
[...]
struct foo {
    collapsed_vec<t_bar, bar_pool> bars;
};

A collapsed_vec is a size and an offset pointing into a pool. It provides contiguous memory while being able to store an unbounded number of elements. The main limitation of a collapsed_vec is that it's insertable only when its end points to the end of the pool.

Strings constitute a special case: a char_pool is generated for them to prevent many small allocations.

<xs:complexType name="foo">
  <xs:sequence>
    <xs:element name="bar" type="string" maxOccurs="unbounded"/>
  </xs:sequence>
</xs:complexType>

generates:

extern char_pool_impl char_pool;
struct foo {
    const char * bar;
};

The pools are freed by using utility functions uxsd::clear_pools() and uxsd::clear_strings(). clear_strings is provided separately since it can be useful to keep the strings around after freeing the generated structures.

3. Data types

You can find the generated types for your schema in output header file foo_uxsdcxx.h. The mapping rules of XSD types to C++ types are such:

<xs:complexType> definitions correspond to C++ structs t_{name}. For complexTypes in global scope, name refers to the name attribute of the type. For complexTypes defined inside elements, name refers to the name attribute of the parent element.
- An <xs:attribute> generates a struct field with a C++ type corresponding to its <xs:simpleType> as defined below.
- A model group such as <xs:choice>, <xs:sequence> or <xs:all> generates struct fields with C++ types corresponding to the types of the elements inside.
  - If an element can occur more than once, a collapsed_vec<T, T_pool> is generated.
  - If an element can occur zero times, another field bool has_T is generated to indicate whether the element is found.
<xs:simpleType> can take many forms.
<xs:union> corresponds to a tagged union type, such as:

struct union_foo {
    type_tag tag;
    union {
        double as_double;
        int as_int;
    };
};

<xs:list> generates a const char *.
Atomic builtins, such as xs:string or xs:int generate a field of the corresponding C++ type(const char *, int...)
<xs:restriction>s of simple types are not supported, except one case where an <xs:string> is restricted to <xs:enumeration> values. C++ enums are generated for such constructs. As an example, the following XSD:

<xs:simpleType name="filler">
  <xs:restriction base="xs:string">
    <xs:enumeration value="FOO"/>
    <xs:enumeration value="BAR"/>
    <xs:enumeration value="BAZ"/>
  </xs:restriction>
</xs:simpleType>

generates a C++ enum:

enum class enum_filler {UXSD_INVALID = 0, FOO, BAR, BAZ};

uxsdcxx's People

Contributors

Stargazers

Watchers

Forkers

litghost kmurray

uxsdcxx's Issues

Sort struct declarations in dependency order

Currently, structs are emitted in the order their corresponding complexTypes are found in the .xsd file.

Order them such that the root element occurs last and every complexType which is included in another type occurs before it(phew). This is required to declare structs as direct children of other structs. It also assists in viewing - just go up for a child type.

Include comments in generated output

I think it would be useful to include comments in your generated parser output.

Suggested comments are;

Header which warns this file is generated and should not be modified directly.
Links and information (md5sum etc) to the xsd the file was generated from.
Before each structure, a copy of the xsd section which was used to generate the structure.
Before each parsing function, an example of the XML that is parsed by this function.
More?

Make lexing ARM-friendly

The current trie lexer makes use of the fast unaligned 32/64 bit access found in amd64 architectures:

inline enum_pin_type lex_pin_type(const char *in){
	unsigned int len = strlen(in);
	switch(len){
	case 4:
		switch(*((triehash_uu32*)&in[0])){
		case onechar('O', 0, 32) | onechar('P', 8, 32) | onechar('E', 16, 32) | onechar('N', 24, 32):
			return enum_pin_type::OPEN;
[...]
}

This might run slower in other architectures. Add a knob to uxsdcxx to generate "flat" tries.

Convert comments into docstrings

You have a lot of comments like this;
https://github.com/duck2/uxsdcxx/blob/65c17a2b626e4d2429256db89b5496e2788ffa99/uxsdcxx.py#L456-L458

They should instead be written like this;

# May cause a bug: sets aren't guaranteed to be ordered.
def _gen_state_tables(t: XsdComplexType) -> str:
    """Generate state transition tables, indexed by token enums."""

Generally all functions should have these docstrings.

Make an XML writer

We can read XML, but can't write it. It would be good to generate functions which build an XML tree out of the exposed structs.

Generate sane default values

I don't think our structs are zeroing out properly. Inspect and fix this.

Furthermore, default values are used in some optional attributes. It would be good to address this.

Transfer this repo to SymbiFlow organization

@duck2 - Would you mind transferring this repository to the SymbiFlow organization so that we can do things like merge, update CI, etc.

sequences accept the wrong inputs

  <xsd:sequence>
     <xsd:element name="isbn" type="isbn"/>
     <xsd:element name="title" type="title"/>
     <xsd:element name="genre" type="genre"/>
     <xsd:element name="author" type="author" maxOccurs="unbounded"/>
   </xsd:sequence>

generates this state machine:

enum class gtok_t_book {ISBN, TITLE, GENRE, AUTHOR};
[...]
int gstate_t_book[NUM_T_BOOK_STATES][NUM_T_BOOK_INPUTS] = {
	{0, -1, -1, -1},
	{0, -1, -1, -1},
	{-1, 1, -1, -1},
	{-1, -1, -1, 2},
	{-1, -1, 3, -1},
};

where initial state is 4 and accept state is 0.
This implies an expected input of "author genre title isbn isbn isbn ..." which is the complete reverse of the described DFA.

Figure out if we can use POD data structures

Using Plain Old Data (POD) data structures makes the allocation / deallocation of them significantly faster.

As we are using a generator, it is significantly easier to confirm correctness around POD allocation / deallocation.

Figure out how to give the state values useful names

Magic numbers are bad.

Need to figure out how to name the states in some useful manner.

Cull unused complex types

Currently, if a schema like

  <xsd:complexType name="person">
    <xsd:sequence>
      <xsd:element name="name" type="xsd:string"/>
      <xsd:element name="born" type="xsd:string"/>
      <xsd:element name="died" type="xsd:string" minOccurs="0"/>
    </xsd:sequence>
  </xsd:complexType>


  <xsd:complexType name="author">
    <xsd:complexContent>
      <xsd:extension base="person">
        <xsd:attribute name="recommends" type="xsd:IDREF"/> <!-- Book -->
      </xsd:extension>
    </xsd:complexContent>
  </xsd:complexType>

is given and only author is referred from the root element, we still generate code for both person and author. This is unnecessary.

Replace master with interface-consumer

Well, it has become the de-facto master branch. Any issues with this @litghost?

Finish shaving the yak for <xs:all> elements

This has really turned into a snake story.

What we are seeking is a solution to the state explosion problem when we want to accept N independent inputs, in any order.

The solution is running N parallel state machines:

The problem with the solution is knowing how to run parallel state machines. When generating a single state machine, it is OK to generate something like this:

while(1){
    switch(state){
    case 0:
        if(inp == "sizing"){
            load(current_element);
            state = 1;
        } else {
            error("expected sizing, found %s" % inp);
        }
        break;
    case 1:
        if(inp == "end of input"){
            goto accept;
        } else {
            error("expected end of input, found %s" % inp);
        }
        break;
    }
    if(next_element) inp = next_element.name;
    else inp = "end of input";
}
accept:
    return;

When generating several state machines, we can:

Make state machines which silently accept inputs unrelated to them:

Add dispatching code which first checks the input and passes it only to the relevant state machine(s).

We also need to know when to accept the input. For that, we would need to check if each of the parallel state machines ended up in an accepting state.

The code for both are very inelegant. If we make silently accepting state machines, it's something like this:

while(1){
    switch(state1){
    case 0:
        if(inp == "sizing"){
            load(current_element);
            state1 = 1;
        } else if(inp == "connection_block" || inp == "default_fc") {
            /* do nothing */
        } else {
            error("expected sizing or connection_block or default_fc, found %s" % inp);
        }
        break;
    case 1:
        if(inp == "end of input"){
            accept[0] = 1;
        } else if(inp == "connection_block" || inp == "default_fc") {
            /* do nothing */
        } else {
            error("expected end of input or connection_block or default_fc, found %s" % inp);
        }
        break;
    }
    switch(state2){
    case 0:
        if(inp == "connection_block"){
            load(current_element);
            state1 = 1;
        } else if(inp == "sizing" || inp == "default_fc") {
            /* do nothing */
    [...]
    if(all_accepted(accept)) break;
    if(next_element) inp = next_element.name;
    else inp = "end of input";
}

The code for the second would look like this:

while(1){
    if(inp == "sizing" || inp == "end of input"){
        switch(state1){
        case 0:
            if(inp == "sizing"){
                load(current_element);
                state1 = 1;
        } else {
            error("expected sizing, found %s" % inp);
        }
        break;
        case 1:
            if(inp == "end of input"){
                accept[0] = 1;
            } else {
                error("expected end of input, found %s" % inp);
            }
        break;
        }
    } else if(inp == "connection_block" || inp == "end of input") {
        switch(state2){
        case 0:
            if(inp == "connection_block"){
                load(current_element);
                state1 = 1;
            } else {
    [...]
    if(all_accepted(accept)) break;
    if(next_element) inp = next_element.name;
    else inp = "end of input";
}

which are definitely not OK to generate, since the code is too long and it's not easy to see what it's doing. On top of that, error messages and checking for accepted states are not complete. I see no way to elegantly implement them in a while-switch loop either.

Support using libxml SAX mode as the XML parser

As you showed previously, using a SAX parser has significant advantages in memory usage. Now that we are auto-generating the parser, it would be good to have the option between pugixml (for the highest speed) and a SAX parser for the lowest memory usage.

Unstable output order

The code generator generate functions in an order that is dependent on the hash order from python, which is unstable. This causes unneeded diffs when re-running the generator.

In generally, function output order should be constant when possible.

Make uxsdcxx.py pep8 clean

Make all your Python code "pep8 clean". Running pep8 on your code should produce no errors.

https://pep8.org/
Black is an auto formatter that might be useful - https://black.readthedocs.io/en/stable/
yapf is another auto formatter that is an alternative - https://github.com/google/yapf

Write a useful README

Add support for reading / writing compressed files

See https://github.com/mithro/duck2-gsoc/issues/16

Add license checking github action

https://github.com/SymbiFlow/actions/tree/main/checks

See if PugiXML's XPath support can be used to implement identity constraints

PugiXML supports a subset of XPath 1.0.

XSD 1.0 has very useful identity constraint tags which work with XPath 1.0. Look if they are compatible and if so, cobble together at least some form of support for <xs:key>, <xs:keyref> and <xs:unique>.

Use sorted(set) when you need ordered output

https://github.com/duck2/uxsdcxx/blob/65c17a2b626e4d2429256db89b5496e2788ffa99/uxsdcxx.py#L457

https://github.com/duck2/uxsdcxx/blob/65c17a2b626e4d2429256db89b5496e2788ffa99/uxsdcxx.py#L456-L467

The Python code you want is;

for i, dfa in enumerate(sorted(t.group_dfas)):
    pass

Add memory freeing functions

Currently, the generated code allocates arenas for elements and strdups some strings, but there is no function to free the memory. Add a freeing function so that people can release the generated structures.

Support true/false for boolean values

Currently, boolean values are read in using std::strtol. Write a runtime function to accept true or false.

Create a boost PropertyTree generator

boost is a popular C++ library and has a sub-library called PropertyTrees. It would be good to have a generator which supports this system.

See the following links for more information,

Optimize the strcmp chains in lex functions

Currently you use the follow structure a lot;

enum_switch_type lex_switch_type(const char *in){
	if(strcmp(in, "mux") == 0){
		return enum_switch_type::MUX;
	}
	else if(strcmp(in, "tristate") == 0){
		return enum_switch_type::TRISTATE;
	}
	else if(strcmp(in, "pass_gate") == 0){
		return enum_switch_type::PASS_GATE;
	}
	else if(strcmp(in, "short") == 0){
		return enum_switch_type::SHORT;
	}
	else if(strcmp(in, "buffer") == 0){
		return enum_switch_type::BUFFER;
	}
	else throw std::runtime_error("Found unrecognized enum value " + std::string(in) + "of enum_switch_type.");
}

Do some profiling and figure out if there is a better way to do this. strcmp might actually turn out to be pretty fast here...

As doing this shouldn't change the external interface, so it is something we can leave till later.

Make generated code match a style guide

It would be good if the generated code matched an existing style guide. I have a lot of experience with the Google C++ style guide @ https://google.github.io/styleguide/cppguide.html As you are using exceptions which the Google C++ style guide forbids, maybe that isn't a good match.

Since this code is supposed to end up in VtR, generating something which matches the clang-format at https://github.com/SymbiFlow/vtr-verilog-to-routing/blob/master%2Bwip/.clang-format might be a good idea too.

Look at if a generated regex parser can do better job than pugixml (performance wise)

pugixml is a generic XML parser. However, now we are generating a parser, we should see if we can do better by creating a parser that only parses files in the exact given XML formats. Using something like Google's re2 would be a good option for that.

Generate a parser from current rr_graph.xsd

To generate a parser from the current rr_graph.xsd, we need:

Support for <xs:attribute>
Support for loading enumerations in the form of <xs:restricted <xs:string>s.

Provide a default driver

Provide a default driver, preferably something which reads a file into memory and prints all fields of all structs. I think this is a nice utility to have.

Generate a parser from current arch.xsd

To generate a parser from the current arch.xml schema, we need:

Support for <xs:attribute>
Support for <xs:enumeration>
Support for <xs:all>
Support for loading <xs:union>s

Create a Captain Proto schema generator

From the XML schema generate a Captain Proto schema.

Load xml into capnp

We can generate cap'n proto schemas and load XML files, but we cannot load XML into the cap'n proto generated structures. Implementing it would be really helpful since a single interface would be enough for handling both Cap'n Proto and XML files.

Preferably, from a foo.xsd file, we generate a foo_uxsdcap.capnp, foo_uxsdcap.h and foo_uxsdcap.cpp.

#include "foo_uxsdcap.capnp.h"
#include "foo_uxsdcap.h"

int main(){
    try {
        ucap::Foo::Reader& foo = ucap::get_root_xml(std::cin);
    } catch {
        std::cout << "PugiXML parse failure\n";
    }
    ucap::put_root_xml(foo, std::cout);
    /* std::cout << foo? */
}

Add support for xi:includes

People might want to break their XML files into independent pieces, so add minimal support(local files) for XML Inclusions.