substrait-io / substrait Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 148.0 848 KB

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.

Home Page: https://substrait.io

License: Apache License 2.0

CSS 0.42% HTML 11.85% Shell 6.92% Python 80.81%

execution-plan query-plan relational-algebra specification

substrait's People

Contributors

Stargazers

Watchers

Forkers

andygrove westonpace jacques-n emkornfield monad-one nastra rymurr cpcloud ives9638 nobigo kbendick mbrobbel felixybw intel-bigdata hannes rui-mo laizhou gforsyth bkietz kexianda blaze-init zzcclp songjx010 thisisnic johanpel jvanstraten georgeap pdet saulpw zhangchunyang19890123 jamesthesnake marin-ma kkxiaotikk sanjibansg yutiansut parth-brahmbhatt puneetjaiswal jcamachor zeroshade paliwalashish jdye64 beettlle zjie1 singhasdev pitrou jinfengni jamesrtaylor timelystream jkself ashvina curino sfc-gh-dakang bzhaoopenstack weijietong vibhatha rok hbutani guykhazma yma11 rtpsw richtia ianmcook penghuijiao almann drin terry1504 yaooqinn nealrichardson spevenhe bishwajitdey nseekhao shimingfei zanmato1984 wjones127 lviiii aaneja eedalong zhixingheyi-tian donghekang loneylee philo-he yanghua danepitkin aziemchawdhary-gs moyun longjiquan vbarua itay-nb chenqin tokoko youngsofun davisusanibar rshura ranaalotaibims piercexiao413 synnada-ai neuroblade lingo-xp icexelloss escanda

substrait's Issues

Introduce the ability to express a correlated subquery

We should introduce a way to express a correlated subquery. We need to come up with the best way to represent this. Some prior art:

Calcite - A filter relation that defines a dynamic variable of it's data that is used in a new expression type that contains a subtree. The subquery subtree references the fields of this dynamic variable to connect the sub and main trees. ([dynamic variable].[field name] e.g. $cor0.P_PARTKEY)
Trino - A relational node type that has a special set symbols from outer tree that are then referenced via special expression assignments from the inner tree. (I think, just a quick review of the code.)
Spark Looks like an expression that contains a subtree, a set of outer expressions and an exprId (that I believe is used inside the inner subtree). This looks/feels a bit like Calcite although Calcite also has the agg functions & groupings.

Other examples people think that should provide inspiration?

Exception field not defined in documents for scalar functions

The scalar function yaml file has an exception field, which isn't mentioned in scalar function definition

Create a way to embed associated protobuf/human readable/etc IDL in documentation pages.

We should define a way to embed the relative portions of serialization and examples in the specification pages to help people understand how things are expressed.

Proposal: Make Extensions.Mapping.FunctionMapping.index optional

This is continuing the discussion from #65.

Rationale:

Given the input types and the output type then the variant for a given function expression could be automatically detected by a validator (or the call could be found to be non conformant because no variant exists which satisfies the criteria).
It's hard to compute (see #65)
Function variants should not change the semantic meaning of the function. Therefore, if producer and consumer agree on input types, output type, and (extension namespaced) function name then they agree on a common semantic meaning. Adding the variant index does not help in this process.

Lastly, I would add that I think most systems will either: declare static mappings between internal operations and semantic definitions or do build-time dynamic binding (as opposed to doing so at runtime). The changes here also make that more straightforward, explicit and reliable.
@jacques-n (quote pulled from discussion on #65)

This is exactly what I am doing. Here is my current implementation:

            var func = new Expression.Types.ScalarFunction
            {
                // FunctionIdFromName is (more or less) a static mapping from function name to a function id
                Id = new Extensions.Types.FunctionId {Id = SubstraitUtil.FunctionIdFromName(functionName)},
                OutputType = SubstraitUtil.SubstraitTypeFromCSharpType(expr.Type)
            };
            func.Args.Add(leftArg);
            func.Args.Add(rightArg);

If I have to add a variant index using the old yaml definition the inner function creation expands with hard-coded variant ids:

            var func = new Expression.Types.ScalarFunction
            {
                OutputType = SubstraitUtil.SubstraitTypeFromCSharpType(expr.Type)
            };
            if (expr.Type.IsAssignableTo(typeof(decimal)))
            {
                func.Id = SubstraitUtil.FunctionIdFromName(functionName, /*variant_idx=*/1);
            }
            else
            {
                func.Id = SubstraitUtil.FunctionIdFromName(functionName, /*variant_idx=*/0);
            }
            func.Args.Add(leftArg);
            func.Args.Add(rightArg);

Under the new YAML proposed it gets even more onerous:

            var func = new Expression.Types.ScalarFunction
            {
                OutputType = SubstraitUtil.SubstraitTypeFromCSharpType(expr.Type)
            };
            int variantIndex = 0;
            switch (func.OutputType.KindCase)
            {
                case Type.KindOneofCase.I8:
                    variantIndex = 1;
                    break;
                case Type.KindOneofCase.I16:
                    variantIndex = 2;
                    break;
                case Type.KindOneofCase.I32:
                    variantIndex = 3;
                    break;
                case Type.KindOneofCase.I64:
                    variantIndex = 4;
                    break;
                case Type.KindOneofCase.Fp32:
                    variantIndex = 5;
                    break;
                case Type.KindOneofCase.Fp64:
                    variantIndex = 6;
                    break;
                case Type.KindOneofCase.Decimal:
                    variantIndex = 7;
                    break;
                default:
                    throw new Exception(
                        $"Illegal arithmetic variant: functionName={functionName} outputType={expr.Type}");
            }
            func.Id = SubstraitUtil.FunctionIdFromName(functionName, variantIndex);
            func.Args.Add(leftArg);
            func.Args.Add(rightArg);

And this is assuming the variant ordering is consistent between all the different arithmetic functions. Indeed, at this point, this seems brittle enough and error-prone enough that I would probably go through the bother of doing the output type derivation myself so I could compute the variant index that way.

In summary, I'm doing a lot more work, but I don't see what I'm gaining.

Proposal: Make the system consumable function yaml easier to integrate

TLDR: @cpcloud and @jacques-n have both been working on building substrait producers and the current yaml functions are too difficult to work with. The yaml needs to be more discrete (which results in something less concise).

A key part of plan translation to/from substrait is the mapping of Substrait function definitions to producer or consumer function definitions. Currently, at least the following translators are underway:

@jacques-n is working on converting a Calcite plan to Substrait. In Calcite, functions are mapped to very generic complex object hierarchies with embedded type resolution logic (e.g. add(ANY, ANY)).
@cpcloud is working on converting Ibis plans to Substrait. In Ibis, functions are mapped to very specific function definitions add(int32,int32).

This means the two efforts have touched different ends of the abstraction specification. Both efforts have found that this mapping process is overly complex using the current representation. In both cases, people have tried to create some kind additional abstractions to make the mapping process more straightforward. Since it is critical to facilitate ease of Substrait integration in many tools, this suggested that the level of abstraction/complexity is not right for easy integrations.

The current function signature yaml docs were primarily optimized to express things concisely. This includes the use of aliases, anchors and merges nodes as well as the capability to express type bounds. While the current form may be a relatively concise representation for human consumption, concision has come at the cost of integration of effort. Part of the integration complexity come from the fact that the functions are neither fully generic nor fully discrete and the current pattern for offset-based resolution. As we were discussing things, supporting fully discrete would make things easiest to reason about.

Recommendation: Move to a representation that is much simpler.

Remove type bounds (only allow type concrete types or pure wildcard type)
Remove yaml-only features

This simpler representation would require less sophistication of integrators comprehending the representation. Additionally, it also allows us to simplify the referencing scheme for binding to a specific function. Our current thoughts are that we should update the referencing scheme to express use a string pattern for referencing the specific discrete type we're bounding to. As an example, add(i32,i32) might become be referenced using one of add<i32,i32>, add_ii, add:ii or something similar to that. While a simple string representation might be somewhat, less fancy than other systems, it allows easier human consumption.

To make things less repetitive on the document side for things like description/name, we're currently thinking that we add an additional level of structure to the yaml document to share common attributes.

I'll look to post a patch proposing the changes.

Proposal: make function names non-symbolic

Most functions do not have symbolic representations, so I think for consistency we should avoid special casing arithmetic or any other commonly-symbolic operator and give them names like add, subtract, etc.

Slack invite link on website requires `@substrait.io` domain

I would love to join though =)

Format: allow arbitrary expressions for grouping keys in AggregateRel

Currently, AggregateRel requires its grouping field to reference all fields by name:

message AggregateRel {
  // ...
  repeated Grouping groupings = 3;
  repeated Measure measures = 4;
  // ...
  message Grouping { repeated int32 input_fields = 1; }
  // ...
}

This forces producers to insert a projection that contains:

all grouping expressions
the unfurled arguments of every aggregation function call

The second is extremely onerous for producers. Any tree-like producer will have to be able to reconsitute every aggregate expression.

This leads to a huge difference in the amount of code needed for producing AggregateRels versus that needed for producing every other Rel variant: https://github.com/cpcloud/ibis/blob/substrait/ibis/backends/substrait/compiler.py#L687-L771 and makes producer code for AggregateRels extremely fragile.

To me, this suggests that the AggregateRel grouping keys are likely at the wrong level of abstraction for the goal of producing a would-be-logical-plan (I understand the line is blurry between logical and physical, perhaps we can call this a "level 0 plan"?).

I propose that we allow arbitrary expressions for grouping.

Question / request for Join operation

The join operation currently support inner, outer, left and right joins. I believe it also supports a cross join by allowing literal True for the join expression. I wonder how semi and anti joins are supported?

An extensible ReadRel::LocalFiles::FileFormat

Right now, if a producer & consumer wanted to support some unsupported (potentially internal/proprietary) file format they would need to create an entirely new relation extension or bury it in the hints, both of which seems a little heavyweight.

Making FileFormat a string would be something of a compromise / hack for extensibility

A string will probably fall short in some cases (e.g. with compressed CSV you would want to know what kind of compression has been applied) though an enum would equally fall short.

Add Insert/Update/Delete basic functionality to specification

This would be really useful for Substrait backends that want to function like a traditional RDBMS rather than query-only/for analytics.

Dev: provide some kind of environment for validating changes

It would be very useful to not have to run CI to test local changes.

Perhaps it's time to set up some kind of testing environment so someone can test their changes in a standardized way.

Question / request for Aggregate operation

We'd like to be able to specify masks for individual aggregations and a boolean ignoreNullKeys for a grouping set.

Masks are input columns of type boolean which allow to mask out rows for individual aggregations, e.g. SELECT count(1) filter (where a > 10) FROM t.

ignoreNullKeys boolean flag allows to avoid unnecessary processing when an aggregation is followed by an inner join on the grouping keys. In this case, rows with nulls in grouping keys cannot possible match the join condition and therefore we'd like to skip aggregations for such groups.

CC: @jacques-n

Add DistributeRel Message

#121

Build on (or take inspiration from) MLIR?

I just saw the announcement email on the arrow mailing list. I read the project vision, and I understand "Why not use SQL?" & "Why not just do this within an existing OSS project?".

But I was wondering if relation to / compatibility with MLIR was considered at all already (which is license-compatible with arrow)? I can understand the desire to start a-fresh, and MLIR is probably also overkill in many ways, but I was thinking that if the IR is designed in an MLIR-compatible way from the start, it might unlock unexpected future benefits, like being able to leverage all the compiler-optimizations available in LLVM.

A quick google search showed that someone (CC @mboehm7) already had a similar idea.

We should make extensions more generic

Right now we list all extensions used at the top of the plan. However, we group items into three different categories of extensions, types, type variations, and functions.

So we have something like:

{
  "functions": ["custom_add", "magnitude"],
  "types": ["complex"],
  "type_variations": ["unsigned"]
}

However, these three categories are insufficient. We already have examples of additional categories (e.g. file formats ala #138, ReadRel::ExtensionTable, ExchangeRel::ExchangeTarget, etc.) and it seems likely we will have more in the future (data serialization formats, metadata catalogs, etc.)

We could get rid of these categories entirely: westonpace@a9ea46f

This gives us:

{ "extensions": [ "custom_add", "magnitude", "complex", "unsigned" ] }

We could also make "extension type" a first class concept with it's own fully qualified name: westonpace@9882d04

This gives us:

{
  "extensions": [
    { "extension_type": "function", "name": "custom_add" },
    { "extension_type": "function", "name": "magnitude" },
    { "extension_type": "type", "name": "complex" },
    { "extension_type": "type_variation", "name": "unsigned" }
  ]
}

I don't have a strong preference on approach and am happy to consider alternatives. Whatever we decide on I'm also happy to help write up the spec. However, I expect this is going to have some backwards incompatible changes so I think it would be better to figure this out sooner rather than later.

Add a window join operation with as-of support

I was talking with @cpcloud about supporting asof. During that discussion it generally felt as if it shouldn't be implemented/conceived of as a traditional join operator because the condition cannot be expressed as a scalar condition. Instead, it requires an expression it really has a window function like partition of analysis that then leads to a return of a set of matching rows given an input row.

As we discussed this, Phillip pointed out that asof is an example of this but not the only one (and there are several variations of as-of). As such, we think creating a new "window join" operation with a new kind of "window join function" or similar probably makes the most sense (with as-of variations being the first window join function to define). For reference, here is the window join operation in KDB: https://code.kx.com/q/ref/wj/

Question: How to convert an function expression to substrait IR

Hi @jacques-n,
As defined in function.proto, each function has a property "Extensions.FuntionId" which should be mapped with a unique implementation of a specific function and also a "Implementation" location. So the implementation will be available in the Implementation:uri, right? So when converting a plan to substrait IR, will the workflow be like following？

analyze expressions of the PlanNode, such gt(a, 8) and validate its implementation status by retrieving info from Implemention:uri, by returning an Extensions.FunctionId?
fulfill other fields of the expression, like fieldReference "a", constant "8" and its output_type "boolean".
Is there a missing on the Op name "gt" in the Expression:ScalarFunction? like it should be
message ScalarFunction {
Extensions.FunctionId id = 1;
string name = 2;
repeated Expression args = 3;
Type output_type = 4;
}

I would like to try converting presto physical plan into substrait IR, is there more info I can refer?
Thanks.

Move the Partition Index out of the message 'FileOrFiles'

It's better to move the Partition Index out of the message 'FileOrFiles', in general, one partition will include some files to be operated. For example, in Spark, one 'FilePartition' will include one partition index and some 'PartitionedFile', one 'PartitionedFile' indicates one file info to be scanned.

Add a github workflow step that verifies correctness of the .proto files before merge

Right now, there is no pre-commit step that verifies the proto files are valid. We should add something that verifies this (even though we don't release any artifacts based on the proto files).

Plan validator/consumer reference implementation

tl;dr: requesting a tool that can give an authoritative answer to the question "does this plan conform to the Substrait spec?"

I've been studying the Substrait specification and protobuf files for a few days now, with the purpose of contributing more extensively to the consumer for Arrow's query engine. While I think I understand the basic ideas for most of it, I'm finding that the specification is not very precise about some of the nitty-gritty details. For example, is the type_variation_reference field of the various builtin type messages mandatory? If yes, what's the type variation of, for example, a literal (which doesn't define a reference)? Based on that I figure the answer is "no," but that means that type_variation_reference == 0 (or possibly some other reserved value?) must mean "no variation," as protobuf just substitutes a default value for primitive types when a field isn't specified. Does that then also mean that anchor 0 is reserved for other such links? And so on. The most annoying part of these types of questions is that I don't know if I just glossed over some part of the spec. I don't like asking pedantic questions when I'm not 100% sure I'm not the problem myself.

The core issue here, in my opinion, is that there is no objective, authoritative way to determine for some given Substrait plan whether A) it is valid and B) what it means. The only way to do that right now is to interpret a plan by hand, which, well... is open to interpretation. Even if a spec is completely precise about everything, it's still very easy for a human to make mistakes here. If however there was a way to do that, I could just feed it some corner cases to resolve above question, or look through its source code.

What's worse though, is how easy it is to make a plan that looks sensible and that protobuf gives a big thumbs-up to, but actually isn't valid at all according to the spec. For example, protobuf won't complain when you make a ReadRel with no schema, as for instance the folks over at https://github.com/duckdblabs/duckdb-substrait-demo seem to have done (for now, anyway). Now, that one pretty obviously contradicts the "required" tag in the spec and they seem to be aware of it, but my point is that relying solely on protobuf's validation makes it very easy to come up with some interpretation of what the Substrait protobuf messages mean that makes sense to you from your frame of reference, to the point where everything seems to work, as DuckDB has already done for the majority of TPC-H... until you try to use Substrait for what it's built for by connecting to some other engine/parser, and find out that nothing actually works together. Or worse (IMO), that it only works for a subset of queries and/or only fails sometimes or for some versions of either tool due to UB, and it's not obvious why or whose fault it is. An issue like that could easily devolve into a finger-pointing contest between projects, especially if a third party finds the problem.

So, my suggestion is to make a Substrait consumer that tells you with authority whether a plan is:

valid, ideally with some human-friendly representation of what it does (I have had a few ideas for this, but none of them strike me as particularly good; even if the tool just gives a thumbs up it's already very useful though);
invalid, and why; or
possibly valid, when validity depends on unknown context (like YAML files that aren't accessible by the validator), or, at least initially, when some check isn't implemented in the validator yet.

Note that when I say "with authority," I mean that once things stabilize, any disagreement between the tool and the spec for a given version should be resolved by patching/adding errata to the spec. If the tool doesn't provide the definitive answer, the answer again becomes based on interpretation, and the tool loses much of its value.

An initial version of the tool can be as simple as just traversing the message tree, verifying that all mandatory fields are present, and verifying that all the anchor/reference links match up. That can then at least tell you when something is obviously invalid or may be valid. After that it can be made smarter incrementally, with things like type checking and cross-validating YAML extension files.

I can try to start this effort, but I don't know if I'd want to maintain it long-term, so I don't know if I'm the right person for this. If I am to start it though, I'm on the fence about using Python or C++, so let me know if there's a preference.

Clarify longer term governance structure?

This might be premature, but does it pay to clarify what the vision for longer term governance of this effort is? (is the goal to incubate this in the ASF?)

Question: Should we cover some distributed communication operators?

I am not sure whether Substrait should provide communication operators like shuffle/exchange relations? It's more around how Substrait is used for distributed execution backend.

Remove unsigned literal definitions in protobuf

These were removed elsewhere but not protobuf.

Implement Calcite native library that parses sql and returns a binary substrait representation

The idea here is to provide a reasonable way for people to give users immediate access to the high quality SQL parsing of Calcite with minimal effort. We'd use GraalVM for AOT compilation and start with a fairly simple function similar to substrait parse(string)

It would be nice if part of this effort was to expose this library with a command line tool that could be piped to other future tools. (For example, create an additional cli that will take a plan and return the results with Datafusion.)

A big question is what catalog to expose in an example cli. Some ideas:

Require a user to provide --table = for Parquet files (similar to how local paths are explored in Docker)
Pass in table declarations --table t1=(int c1, int c2) and treat the read objects as opaque (for later binding use by an execution system)?
Other ideas?

A second fun thing to add would be a separate library that is plan in and out and applies a list of optimization rules using one of the existing Calcite optimizers. Lower priority than the sql parser initially but could be intersting to evaluate different optimization patterns and start exposing nice Calcite interfaces for things like python/rust/etc.

Setup automated releases

@cpcloud and I have been discussing introducing automated releases/release notes and starting to version things. We should get this going since other projects would be much better off depending on a released set of substrait proto definitions, etc as opposed to using a git hash.

Format: clarify the units of date literals

date literals are defined as fixed32 values, but no information about what the units of those values are is documented.

A 32-bit integer doesn't allow for more than about 9 decimal digits, so seconds-since-epoch seems like the only possible interpretation of units here if my arithmetic is correct.

I guess this raises another question of whether we should move to int64 and use milliseconds-since-epoch?

Question: What is the element type of an empty list or map literal?

It's not clear if or how an empty collection type (list or map) can be represented right now. The current Literal.List type allows for an empty sequence, but it's not clear how the type information of such a literal would be known.

First release of Substrait

I've had several people ask me about starting to have releases. I think it would be a good idea to do a first release in the next month or so. This would allow people to start working with some prebuilt artifacts and read release notes, etc between releases.

Add github workflow action to ensure doc changes build on PRs

Docs: Add decimal to the logical types docs

It looks like https://substrait.io/types/simple_logical_types/ doesn't have any information about decimal types.

Question: why are IfThen and SwitchExpression messages near exact copies of each other?

What is the use case for having IfThen and SwitchExpression messages that are effectively exact copies of each other?

  message IfThen {
    repeated IfClause ifs = 1;
    Expression else = 2;

    message IfClause {
      Expression if = 1;
      Expression then = 2;
    }
  }

  message SwitchExpression {
    repeated IfValue ifs = 1;
    Expression else = 2;

    message IfValue {
      Expression if = 1;
      Expression then = 2;
    }
  }

I understand the difference between the two different types of CASE expressions in SQL.

Can't we represent SwitchExpression in terms of an IfThen with an equality condition?

If we still need to distinguish between the two ... cases (pun intended), can we refactor this to use an enum for differentiating instead of two effectively identical constructs?

c++project for substrait

Do we have plan to write a project named substrait-cpp just like substrait-java ?

Add FAQ on how substrait relates to Beam's work on relational modelling?

A reply to the announcement on the BEAM dev mailing list had some substantive questions buried in it. It might pay to write up something as to what differences/overlap substrait is with Beam's efforts.

Question / request for Project operation

The project operation will produce one or more additional expressions based on the inputs of the dataset.

Would it be possible to allow project operation to optionally drop columns, e.g. have it defined as adding zero or more expressions and dropping zero or more columns?

Request for Unnest operation

Would it be possible to add an Unnest operation?

https://prestodb.io/docs/current/sql/select.html#unnest

Style: rename Type.NamedStruct to toplevel Schema?

Currently Type.NamedStruct is not part of type system and is only being used to represent schemas in read relations. I think we should move this out and rename it to Schema until there's another concrete, documented use case.

How should we express pushdown filtering?

The ReadRel has a filter property defined (in the markdown) as:

A boolean Substrait expression that describes the filter of a iceberg dataset. TBD: define how field referencing works.

I assume this is intended for pushing filtering down into the read. However, it isn't clear to me if it is expected that the consumer is able to fully satisfy the filter or not. There are many cases where filters cannot be pushed down, and some of those cases won't be known at plan creation time.

For example, if the source files are CSV then no filter pushdown is possible. If the source files are parquet but there are no column statistics or bloom filters enabled for a particular source file.

As a result, in Arrow, we typically define plans with both a pushdown filter and a filter node. The pushdown filter is basically a best-effort hint.

Is this the same intent in Substrait? Or, if a consumer can't satisfy the filter in the ReadRel, is it expected that the consumer will do the additional filtering in-memory?

Evaluate an alternative for physical properties that reduces implicit requirements

Based on this slack thread there has been discussion about moving to a more explicit form of physical properties than the current, relatively implicit formulation. This could reduce behavior requirements for executors when they are not needed, reducing execution cost. Additionally, explicit properties would be easier for new users to understand/work with.

This task will be to evaluate alternatives and see if we can find something better than the current system.

Save the Partition Infos in ReadRel

It would be very useful to save the Partition Infos (including the partition index, file path, start and length) in ReadRel.

Proposal: split options out of the arguments list in extensions functions YAML

It's a nuisance to have to search for a mapping with a specific field called options when processing arguments. I think we should split options into its own field and keep arguments just for positional arguments.

Question: why does ReadRel include a schema?

It's not clear to me why ReadRel includes the base schema of the table to be read. That seems like something the producer would configure in a catalog and that the consumer would look up in the same catalog.

Types: Decimal literals should specify the meaning of their bytes

From apache/arrow#11707
There's no documentation about how the bytes of a decimal literal should be interpreted.

I suspect the intent is 2's complement, little-endian. We should write that as a documentation comment in the proto.

SubQuery Support in Substrait

Subqueries are common in sql. Typically, it can appear in two forms.
1)In expression, for example,
SELECT * FROM lineitem WHERE orderkey IN (SELECT orderkey FROM orders WHERE (orderkey + custkey) % 2 = 0)
2)As part of from clause, such as
SELECT distinct orderkey FROM (SELECT orderkey FROM lineitem WHERE linenumber = 5)

For the subquery in expression, database engine usually use a special node type in expression tree (e.g. expSubQuery) to reprsent it.

For the subquery in from clause, database engine have two options to repesent it:
Option1: Introduce a subquery node (e.g. RelSubquery) in plan tree which points to the plan tree of the subquery.
Option2: Simply flattern the subquery plantree as the input of the main query.

Should substrait also need representations for subquery?

Types: Do decimal literals need to carry around their precision and scale?

It's not clear to me whether decimal literals should carry around their type.

I can imagine a trivial SQL query like SELECT decimal('234.234'), and it's not clear what the type of that expression would be.

Should decimal literals also carry around their precision and scale information?

Evaluate how Substrait should/could work with ONNX

Some people have asked about supporting less-sql operations in Substrait. Ideally, we would be able to support a broad range of non-sql patterns. How we do this is up for debate. One question was what kind of integration of ONNX might make sense. It would be good to formalize a perspective about this.

Does ReadRel::base_schema include columns hidden by projection?

The ReadRel relation states:

Direct Schema | Defines the schema of the output of the read (before any emit remapping/hiding).
Projection | A masked complex expression describing the portions of the content that should be read

Does the Direct Schema include fields hidden by Projection (which is not technically the same thing as emit remapping / hiding)?

Bug: Docs link on the community page is 404ing

The Docs link on https://substrait.io/community/ is broken. We should fix it.

Add license header to classes

I think we should add a license header to the classes.
Example:

 * The Substrait Project licenses this file to you under the Apache License,
 * version 2.0 (the "License"); you may not use this file except in compliance
 * with the License. You may obtain a copy of the License at:
 *
 *   https://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations
 * under the License.

Is there already a plan for this else I can work on it.

Nullable, valued literals are not currently expressible

Right now, there is no way to express a nullable literal for non-null values. I think we need to introduce a "nullable" property on all literals.

For example, I want to create an i8? literal that is defined as 5. Right now, this is impossible. I can only declare an i8 of 5 or a i8? that is null. In a 2x2, the problem is the lower right below:

Category	No Value	With Value
Non-nullable	Typed Null Literal	Value Literal
Nullable	Typed Null Literal	Not Possible

While a bit arcane, this is important to express in strongly typed systems when trying to do a roundtrip.

substrait-io / substrait Goto Github PK

substrait's People

Contributors

Stargazers

Watchers

Forkers

substrait's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs