GithubHelp home page GithubHelp logo

symbiflow / uxsdcxx Goto Github PK

View Code? Open in Web Editor NEW
6.0 10.0 2.0 379 KB

generate C++ reader/writer from XSD schema

License: Apache License 2.0

Python 93.00% Makefile 0.47% C++ 6.53%
xsd pugixml xml xsd-schema xml-schema xml-serialization symbiflow

uxsdcxx's Introduction

uxsdcxx

Disclaimer: Pre-1.0.0 software. Support for anything might break.

uxsdcxx is a tool to generate PugiXML-based C++ reader, validator and writer from an XSD schema. It can generate code for a subset of XSD 1.0.

It currently supports:

  • Simple types with following exceptions:
    • xs:lists are just read into a string.
    • Only enumerations are supported as xs:restrictions of simple types.
    • Restricted string types such as IDREF, NCName etc. aren't validated.
  • Complex types.
  • Model groups(all, sequence and choice)
  • Elements.
  • Attributes except xs:anyAttribute.
    • Default values are supported.
    • When writing, non-zero default values are always written out.

It currently does not support:

  • Anything that PugiXML can't read:
    • XML namespaces

Getting started

pip install uxsdcxx. Use with uxsdcxx.py foo.xsd. Two files foo_uxsdcxx.h and foo_uxsdcxx.cpp will be created.

API

All uxsdcxx functions live in a namespace uxsd.

1. Root element class

For every root element in the schema, a class is generated with load and write functions. For instance,

<xs:element name="foo">
  <xs:complexType>
  ...
  </xs:complexType>
</xs:element>

results in this C++ code:

class foo : public t_foo {
public:
    pugi::xml_parse_result load(std::istream &is);
    void write(std::ostream &os);
};

load() loads from an input stream into this root element's structs and write() writes its content to a given output stream.

Note that root elements with simple types are not supported.

2. Pools

uxsdcxx generates global pools to store multiply-occurring types.

<xs:complexType name="foo">
  <xs:sequence>
    <xs:element name="bar" type="bar" maxOccurs="unbounded"/>
  </xs:sequence>
</xs:complexType>

generates:

extern std::vector<t_bar> bar_pool;
[...]
struct foo {
    collapsed_vec<t_bar, bar_pool> bars;
};

A collapsed_vec is a size and an offset pointing into a pool. It provides contiguous memory while being able to store an unbounded number of elements. The main limitation of a collapsed_vec is that it's insertable only when its end points to the end of the pool.

Strings constitute a special case: a char_pool is generated for them to prevent many small allocations.

<xs:complexType name="foo">
  <xs:sequence>
    <xs:element name="bar" type="string" maxOccurs="unbounded"/>
  </xs:sequence>
</xs:complexType>

generates:

extern char_pool_impl char_pool;
struct foo {
    const char * bar;
};

The pools are freed by using utility functions uxsd::clear_pools() and uxsd::clear_strings(). clear_strings is provided separately since it can be useful to keep the strings around after freeing the generated structures.

3. Data types

You can find the generated types for your schema in output header file foo_uxsdcxx.h. The mapping rules of XSD types to C++ types are such:

  • <xs:complexType> definitions correspond to C++ structs t_{name}. For complexTypes in global scope, name refers to the name attribute of the type. For complexTypes defined inside elements, name refers to the name attribute of the parent element.

    • An <xs:attribute> generates a struct field with a C++ type corresponding to its <xs:simpleType> as defined below.
    • A model group such as <xs:choice>, <xs:sequence> or <xs:all> generates struct fields with C++ types corresponding to the types of the elements inside.
      • If an element can occur more than once, a collapsed_vec<T, T_pool> is generated.
      • If an element can occur zero times, another field bool has_T is generated to indicate whether the element is found.
  • <xs:simpleType> can take many forms.

  • <xs:union> corresponds to a tagged union type, such as:

struct union_foo {
    type_tag tag;
    union {
        double as_double;
        int as_int;
    };
};
  • <xs:list> generates a const char *.
  • Atomic builtins, such as xs:string or xs:int generate a field of the corresponding C++ type(const char *, int...)
  • <xs:restriction>s of simple types are not supported, except one case where an <xs:string> is restricted to <xs:enumeration> values. C++ enums are generated for such constructs. As an example, the following XSD:
<xs:simpleType name="filler">
  <xs:restriction base="xs:string">
    <xs:enumeration value="FOO"/>
    <xs:enumeration value="BAR"/>
    <xs:enumeration value="BAZ"/>
  </xs:restriction>
</xs:simpleType>

generates a C++ enum:

enum class enum_filler {UXSD_INVALID = 0, FOO, BAR, BAZ};

uxsdcxx's People

Contributors

duck2 avatar kmurray avatar litghost avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

litghost kmurray

uxsdcxx's Issues

Sort struct declarations in dependency order

Currently, structs are emitted in the order their corresponding complexTypes are found in the .xsd file.

Order them such that the root element occurs last and every complexType which is included in another type occurs before it(phew). This is required to declare structs as direct children of other structs. It also assists in viewing - just go up for a child type.

Include comments in generated output

I think it would be useful to include comments in your generated parser output.

Suggested comments are;

  • Header which warns this file is generated and should not be modified directly.
  • Links and information (md5sum etc) to the xsd the file was generated from.
  • Before each structure, a copy of the xsd section which was used to generate the structure.
  • Before each parsing function, an example of the XML that is parsed by this function.
  • More?

Make lexing ARM-friendly

The current trie lexer makes use of the fast unaligned 32/64 bit access found in amd64 architectures:

inline enum_pin_type lex_pin_type(const char *in){
	unsigned int len = strlen(in);
	switch(len){
	case 4:
		switch(*((triehash_uu32*)&in[0])){
		case onechar('O', 0, 32) | onechar('P', 8, 32) | onechar('E', 16, 32) | onechar('N', 24, 32):
			return enum_pin_type::OPEN;
[...]
}

This might run slower in other architectures. Add a knob to uxsdcxx to generate "flat" tries.

Make an XML writer

We can read XML, but can't write it. It would be good to generate functions which build an XML tree out of the exposed structs.

Generate sane default values

I don't think our structs are zeroing out properly. Inspect and fix this.

Furthermore, default values are used in some optional attributes. It would be good to address this.

sequences accept the wrong inputs

  <xsd:sequence>
     <xsd:element name="isbn" type="isbn"/>
     <xsd:element name="title" type="title"/>
     <xsd:element name="genre" type="genre"/>
     <xsd:element name="author" type="author" maxOccurs="unbounded"/>
   </xsd:sequence>

generates this state machine:

enum class gtok_t_book {ISBN, TITLE, GENRE, AUTHOR};
[...]
int gstate_t_book[NUM_T_BOOK_STATES][NUM_T_BOOK_INPUTS] = {
	{0, -1, -1, -1},
	{0, -1, -1, -1},
	{-1, 1, -1, -1},
	{-1, -1, -1, 2},
	{-1, -1, 3, -1},
};

where initial state is 4 and accept state is 0.
This implies an expected input of "author genre title isbn isbn isbn ..." which is the complete reverse of the described DFA.

Figure out if we can use POD data structures

Using Plain Old Data (POD) data structures makes the allocation / deallocation of them significantly faster.

As we are using a generator, it is significantly easier to confirm correctness around POD allocation / deallocation.

Cull unused complex types

Currently, if a schema like

  <xsd:complexType name="person">
    <xsd:sequence>
      <xsd:element name="name" type="xsd:string"/>
      <xsd:element name="born" type="xsd:string"/>
      <xsd:element name="died" type="xsd:string" minOccurs="0"/>
    </xsd:sequence>
  </xsd:complexType>


  <xsd:complexType name="author">
    <xsd:complexContent>
      <xsd:extension base="person">
        <xsd:attribute name="recommends" type="xsd:IDREF"/> <!-- Book -->
      </xsd:extension>
    </xsd:complexContent>
  </xsd:complexType>

is given and only author is referred from the root element, we still generate code for both person and author. This is unnecessary.

Finish shaving the yak for <xs:all> elements

This has really turned into a snake story.

What we are seeking is a solution to the state explosion problem when we want to accept N independent inputs, in any order.

The solution is running N parallel state machines:
fsm

The problem with the solution is knowing how to run parallel state machines. When generating a single state machine, it is OK to generate something like this:

while(1){
    switch(state){
    case 0:
        if(inp == "sizing"){
            load(current_element);
            state = 1;
        } else {
            error("expected sizing, found %s" % inp);
        }
        break;
    case 1:
        if(inp == "end of input"){
            goto accept;
        } else {
            error("expected end of input, found %s" % inp);
        }
        break;
    }
    if(next_element) inp = next_element.name;
    else inp = "end of input";
}
accept:
    return;

When generating several state machines, we can:

  1. Make state machines which silently accept inputs unrelated to them:

modifiedfsm

  1. Add dispatching code which first checks the input and passes it only to the relevant state machine(s).

We also need to know when to accept the input. For that, we would need to check if each of the parallel state machines ended up in an accepting state.

The code for both are very inelegant. If we make silently accepting state machines, it's something like this:

while(1){
    switch(state1){
    case 0:
        if(inp == "sizing"){
            load(current_element);
            state1 = 1;
        } else if(inp == "connection_block" || inp == "default_fc") {
            /* do nothing */
        } else {
            error("expected sizing or connection_block or default_fc, found %s" % inp);
        }
        break;
    case 1:
        if(inp == "end of input"){
            accept[0] = 1;
        } else if(inp == "connection_block" || inp == "default_fc") {
            /* do nothing */
        } else {
            error("expected end of input or connection_block or default_fc, found %s" % inp);
        }
        break;
    }
    switch(state2){
    case 0:
        if(inp == "connection_block"){
            load(current_element);
            state1 = 1;
        } else if(inp == "sizing" || inp == "default_fc") {
            /* do nothing */
    [...]
    if(all_accepted(accept)) break;
    if(next_element) inp = next_element.name;
    else inp = "end of input";
}

The code for the second would look like this:

while(1){
    if(inp == "sizing" || inp == "end of input"){
        switch(state1){
        case 0:
            if(inp == "sizing"){
                load(current_element);
                state1 = 1;
        } else {
            error("expected sizing, found %s" % inp);
        }
        break;
        case 1:
            if(inp == "end of input"){
                accept[0] = 1;
            } else {
                error("expected end of input, found %s" % inp);
            }
        break;
        }
    } else if(inp == "connection_block" || inp == "end of input") {
        switch(state2){
        case 0:
            if(inp == "connection_block"){
                load(current_element);
                state1 = 1;
            } else {
    [...]
    if(all_accepted(accept)) break;
    if(next_element) inp = next_element.name;
    else inp = "end of input";
}

which are definitely not OK to generate, since the code is too long and it's not easy to see what it's doing. On top of that, error messages and checking for accepted states are not complete. I see no way to elegantly implement them in a while-switch loop either.

Support using libxml SAX mode as the XML parser

As you showed previously, using a SAX parser has significant advantages in memory usage. Now that we are auto-generating the parser, it would be good to have the option between pugixml (for the highest speed) and a SAX parser for the lowest memory usage.

Unstable output order

The code generator generate functions in an order that is dependent on the hash order from python, which is unstable. This causes unneeded diffs when re-running the generator.

In generally, function output order should be constant when possible.

Add memory freeing functions

Currently, the generated code allocates arenas for elements and strdups some strings, but there is no function to free the memory. Add a freeing function so that people can release the generated structures.

Optimize the strcmp chains in lex functions

Currently you use the follow structure a lot;

enum_switch_type lex_switch_type(const char *in){
	if(strcmp(in, "mux") == 0){
		return enum_switch_type::MUX;
	}
	else if(strcmp(in, "tristate") == 0){
		return enum_switch_type::TRISTATE;
	}
	else if(strcmp(in, "pass_gate") == 0){
		return enum_switch_type::PASS_GATE;
	}
	else if(strcmp(in, "short") == 0){
		return enum_switch_type::SHORT;
	}
	else if(strcmp(in, "buffer") == 0){
		return enum_switch_type::BUFFER;
	}
	else throw std::runtime_error("Found unrecognized enum value " + std::string(in) + "of enum_switch_type.");
}

Do some profiling and figure out if there is a better way to do this. strcmp might actually turn out to be pretty fast here...

As doing this shouldn't change the external interface, so it is something we can leave till later.

Make generated code match a style guide

It would be good if the generated code matched an existing style guide. I have a lot of experience with the Google C++ style guide @ https://google.github.io/styleguide/cppguide.html As you are using exceptions which the Google C++ style guide forbids, maybe that isn't a good match.

Since this code is supposed to end up in VtR, generating something which matches the clang-format at https://github.com/SymbiFlow/vtr-verilog-to-routing/blob/master%2Bwip/.clang-format might be a good idea too.

Provide a default driver

Provide a default driver, preferably something which reads a file into memory and prints all fields of all structs. I think this is a nice utility to have.

Generate a parser from current arch.xsd

To generate a parser from the current arch.xml schema, we need:

  • Support for <xs:attribute>
  • Support for <xs:enumeration>
  • Support for <xs:all>
  • Support for loading <xs:union>s

Load xml into capnp

We can generate cap'n proto schemas and load XML files, but we cannot load XML into the cap'n proto generated structures. Implementing it would be really helpful since a single interface would be enough for handling both Cap'n Proto and XML files.

Preferably, from a foo.xsd file, we generate a foo_uxsdcap.capnp, foo_uxsdcap.h and foo_uxsdcap.cpp.

#include "foo_uxsdcap.capnp.h"
#include "foo_uxsdcap.h"

int main(){
    try {
        ucap::Foo::Reader& foo = ucap::get_root_xml(std::cin);
    } catch {
        std::cout << "PugiXML parse failure\n";
    }
    ucap::put_root_xml(foo, std::cout);
    /* std::cout << foo? */
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.