tl;dr: requesting a tool that can give an authoritative answer to the question "does this plan conform to the Substrait spec?"
I've been studying the Substrait specification and protobuf files for a few days now, with the purpose of contributing more extensively to the consumer for Arrow's query engine. While I think I understand the basic ideas for most of it, I'm finding that the specification is not very precise about some of the nitty-gritty details. For example, is the type_variation_reference
field of the various builtin type messages mandatory? If yes, what's the type variation of, for example, a literal (which doesn't define a reference)? Based on that I figure the answer is "no," but that means that type_variation_reference == 0
(or possibly some other reserved value?) must mean "no variation," as protobuf just substitutes a default value for primitive types when a field isn't specified. Does that then also mean that anchor 0 is reserved for other such links? And so on. The most annoying part of these types of questions is that I don't know if I just glossed over some part of the spec. I don't like asking pedantic questions when I'm not 100% sure I'm not the problem myself.
The core issue here, in my opinion, is that there is no objective, authoritative way to determine for some given Substrait plan whether A) it is valid and B) what it means. The only way to do that right now is to interpret a plan by hand, which, well... is open to interpretation. Even if a spec is completely precise about everything, it's still very easy for a human to make mistakes here. If however there was a way to do that, I could just feed it some corner cases to resolve above question, or look through its source code.
What's worse though, is how easy it is to make a plan that looks sensible and that protobuf gives a big thumbs-up to, but actually isn't valid at all according to the spec. For example, protobuf won't complain when you make a ReadRel
with no schema, as for instance the folks over at https://github.com/duckdblabs/duckdb-substrait-demo seem to have done (for now, anyway). Now, that one pretty obviously contradicts the "required" tag in the spec and they seem to be aware of it, but my point is that relying solely on protobuf's validation makes it very easy to come up with some interpretation of what the Substrait protobuf messages mean that makes sense to you from your frame of reference, to the point where everything seems to work, as DuckDB has already done for the majority of TPC-H... until you try to use Substrait for what it's built for by connecting to some other engine/parser, and find out that nothing actually works together. Or worse (IMO), that it only works for a subset of queries and/or only fails sometimes or for some versions of either tool due to UB, and it's not obvious why or whose fault it is. An issue like that could easily devolve into a finger-pointing contest between projects, especially if a third party finds the problem.
So, my suggestion is to make a Substrait consumer that tells you with authority whether a plan is:
- valid, ideally with some human-friendly representation of what it does (I have had a few ideas for this, but none of them strike me as particularly good; even if the tool just gives a thumbs up it's already very useful though);
- invalid, and why; or
- possibly valid, when validity depends on unknown context (like YAML files that aren't accessible by the validator), or, at least initially, when some check isn't implemented in the validator yet.
Note that when I say "with authority," I mean that once things stabilize, any disagreement between the tool and the spec for a given version should be resolved by patching/adding errata to the spec. If the tool doesn't provide the definitive answer, the answer again becomes based on interpretation, and the tool loses much of its value.
An initial version of the tool can be as simple as just traversing the message tree, verifying that all mandatory fields are present, and verifying that all the anchor/reference links match up. That can then at least tell you when something is obviously invalid or may be valid. After that it can be made smarter incrementally, with things like type checking and cross-validating YAML extension files.
I can try to start this effort, but I don't know if I'd want to maintain it long-term, so I don't know if I'm the right person for this. If I am to start it though, I'm on the fence about using Python or C++, so let me know if there's a preference.