Comments (8)
I've opened a draft PR: #10473 and will try to wrap it up in the following days.
from arrow-datafusion.
Here is one example API that I would love to implement with such a tree-node api: #10505
Thanks for sharing this @alamb. It's good to know that there are other possible usecases for this new API. #10473 seems to pass all tests now. I will extract the first commit of it into a separate PR today or tomorrow to add the new TreeNode API.
from arrow-datafusion.
I'm happy to take this.
from arrow-datafusion.
Are there any potential issues with simply using the existing Hash
implementation of Expr
to create HashSet
s?
Serveral other optimization passes use string names as keys for expressions in data structures. I am wondering if any of these could also be refactored to simply use HashSet<Expr>
or HashSet<&Expr>
synthetic group by expressions for aggregates:
datafusion/datafusion/expr/src/logical_plan/builder.rs
Lines 1246 to 1270 in accce97
functional dependencies heavily uses display_name
to represent group by exprs:
https://github.com/apache/datafusion/blob/main/datafusion/common/src/functional_dependencies.rs
decorrelate:
push down filter for aggregates:
datafusion/datafusion/optimizer/src/push_down_filter.rs
Lines 788 to 837 in accce97
single distinct to group by:
datafusion/datafusion/optimizer/src/single_distinct_to_groupby.rs
Lines 69 to 96 in accce97
from arrow-datafusion.
Are there any potential issues with simply using the existing
Hash
implementation ofExpr
to createHashSet
s?Serveral other optimization passes use string names as keys for expressions in data structures. I am wondering if any of these could also be refactored to simply use
HashSet<Expr>
orHashSet<&Expr>
Thanks for these references @erratic-pattern.
Background and general thoughts:
I'm only familiar with CSE code and in its case unfortunately non-unique stringified expression were used as keys of the map that stores the occurrance counts. This bug was introduced in #9871 and reverted in #10396. The issue with these colliding string keys are explained here in details: #10333 (comment).
Some thougths about CSE:
After #10396 we still use stringified expressions as keys (Identifier
), but the strings we use encode whole expression subtrees. This is far from optimal and this ticket / my work in progress change would like to help with that.
In case of CSE we could use Expr
as keys of the ExprStats
map, but then we would need to clone Expr
s when we fill up the ExprStats
map during the first traversal. This would be particulary costly in CSE because we need to store not only the counts for all top level expressions, but the counts of all their descendant subexpressions.
We could also use &Expr
as keys (and so we didn't need to clone the expressions), but there is a problem here. The current TreeNode::apply()
/ TreeNode::visit()
APIs aren't capable to fill up such a HashMap<&Expr, ...>
map. This is because of restricted TreeNode
reference lifetimes used in closures / TreeNodeVisitor
methods.
I.e. this currently doesn't work:
let e = sum((col("a") * (lit(1) - col("b"))) * (lit(1) + col("c")));
let mut m = HashMap::new();
e.apply(|e| {
*m.entry(e).or_insert(0) += 1;
Ok(TreeNodeRecursion::Continue)
});
println!("m: {:#?}", m);
This issue can be solved by adding new TreeNode
APIs or fixing the current ones.
I have a WIP commit here: peter-toth@e844799 that adds TreeNode::apply_ref()
/ TreeNode::visit_ref()
.
Using apply_ref()
in the above example would make it work, but I haven't opened a PR yet as there are a few things to consider:
a. We don't really want to add any more new APIs (especially if their puspose is similar to existing ones).
b. We can't change the lifetimes of references in the current apply()
/ visit()
easily. This is because some TreeNode
implementations are not compatible with that. (E.g. DynTreeNode
doesn't have a method to get references to its children, LogicalPlan
creates temprorary objects in its map_subqueries()
, ...).
Despite the fact that my WIP commit adds new APIs, I would prefer and lean towards option b.. But since I'm only aware of this ticket that requires this change to the APIs, I haven't opened the PR yet.
Now there is another thing to consider if we want use &Expr
as keys of ExprStats
. The current CSE algorithm, that was added by the original author of CSE in DataFusion (and not myself), is very clever and does the following:
In the first traversal it:
- Creates a mapping for each top level expression (this is called
IdArray
) that stores the preorder visit index of a node to anIdentifier
(of a subexpression tree). - And also creates a map (this is called
ExprStats
) that contains theIdentifier
-> count stats gathered for all top level expressions and their subexpressions.
This is very nice, because the second, rewriting traversal can use the preorder visit index again to look up the identifier first and then the count from the ExprStats
map. Providing that an identifier is small, this can be much faster then using &Expr
as keys because:
- Computing
hash()
of an&Expr
(instead of using preorder index) in the second traversal is costly if the expression is deep and contains lots of indirections (Box
es). - When we generate the identifiers in the first traversal we can use the traversal's bottom-up phase to build up identifiers from the current node and the identifiers of the node's children very effectively.
In my work in progress change for this issue I would like to finalize the:
TreeNode
API changes required (maybe open a separate PR for it)- and replce the current String based identifier to a
(u64, &Expr)
like tuple/struct.
The first item contains a precomputed hash of the identifier. (As I mentioned, we can use the bottom-up phase of the first traversal to compute that effectively since this logic is already implmeneted in the CSE algorithm.) The overridenhash()
of the struct should return this precomputed hash.
The second item is a&Expr
that can be used in the struct'seq()
implementation in case of hash collision.
Back to the original question of using HashSet<Expr, ...>
or HashSet<&Expr, ...>
:
I think both are accepable but CSE is special as the maps need to store all the descendant subexpressions as well and the impemented CSE algorithm seems to offer a way to implement a better identifier than just a simple &Expr
.
I don't know the other referenced usecases but if collision of string names can happen there then we should definitely fix it.
from arrow-datafusion.
Thanks for the detailed write up @peter-toth . Though I did mention HashSet<Expr>
specifically, my suggestion more generally goes along the lines of using the Hash
implementation in some way to produce the identifiers. After looking at the code a bit more, I do see the cloning/lifetime issues with using Expr
or &Expr
as keys directly. I also did not consider the cost of re-computing hashes. I do think in that case it does make sense to pre-compute the hash instead.
I like the idea of generalizing the (u64, &Expr)
struct into something reuseable across optimizations, as it seems to be a common pattern where we need to:
- produce some unique identifier for an expression that can be stored in a data structure
- use that identifier to generate aliases for newly generated expressions, or create a new
Column
/Field
somewhere with that expression as a name. this can be done thanks to the&Expr
in the struct which would allow us to calldisplay_name
- do so in a way that doesn't conflict with ownership/borrowing semantics. we might still run into borrowing issues because of the
&Expr
reference, but it's hard to say without trying to adapt this solution to other optimizers.Rc
orArc
is a potential option as well. The struct could potentially be generic overBorrow
to support any of these. - avoid recomputing the hash/key on every insert/lookup operation
Anyway, I don't want to over-abstract just yet, so for now just build something that works for CSE and then we can take it and see if it can be applied to any of the other optimizations.
I am curious if overriding hash()
in this way will conflict with the Hash Eq property in some unforseen way. I think as long as we're constructing it such that the &Expr
is always a reference to the Expr
that produced the hash, it should be fine.
from arrow-datafusion.
I like the idea of generalizing the
(u64, &Expr)
struct into something reuseable across optimizations.
Honestly, I don't know those referenced usecases, but I feel (u64, &Expr)
(and any Identifier
in general) makes sense only for CSE (2 traversals, we can build up a preorder visit cache of Identifier
s in the first traversal and second traversal is top-down) and not sure the others have the same characteristics... If that's not the case then it doesn't make sense to use Identifier
s instead of Expr
/&Expr
s.
Anyways, I will try to open the PR with it next week and then feel free to generalize the idea for other usecases if it makes sense.
from arrow-datafusion.
I have a WIP commit here: peter-toth@e844799 that adds TreeNode::apply_ref() / TreeNode::visit_ref().
Using apply_ref() in the above example would make it work, but I haven't opened a PR yet as there are a few things to consider:
...
But since I'm only aware of this ticket that requires this change to the APIs, I haven't opened the PR yet.
Here is one example API that I would love to implement with such a tree-node api: #10505
I also ran into an example when trying to find embedded Subquery
s in an Expr
in
datafusion/datafusion/optimizer/src/scalar_subquery_to_join.rs
Lines 54 to 68 in 424757f
from arrow-datafusion.
Related Issues (20)
- Move optimizer rule that has aggregate function out of core HOT 3
- Apply guarantee rewriter to sql workflow HOT 8
- Add to_date function to scalar functions doc
- Add to_unixtime function to scalar functions doc
- bug: `CAST(<array>)` causes internal error HOT 3
- Implement `LogicalPlanBuilder::from` for `Arc<LogicalPlan>` HOT 5
- to_date with a date string and format fails with error parsing timestamp
- Docker CLI build fails in WSL2 - "Ubuntu 22.04.4 LTS"
- Seperate out common types from Datafusion Proto HOT 1
- Connection reset by peer on AWS S3 object store. HOT 1
- Document committer / PMC process
- Create presentation for DataFusion SIGMOD 2024 paper
- Keynote presentation for SiMoD workshop at SIGMOD 2024 HOT 1
- DataFusion weekly project plan (Andrew Lamb) - May 13, 2024 HOT 1
- Convert internal representation of LogicalPlanBuilder from `LogicalPlan` to `Arc<LogicalPlan>` HOT 3
- [Regression] Query using ARRAY_AGG(DISTINCT) causes panic HOT 5
- Add `ProgressiveEval` operator for optimize `SortPreservingMerge` HOT 3
- SortMergeJoin: The query stuck when join filter is set and more matched rows than batch size
- API to get all `Column` references in an `Expr` without cloning `Columns`
- Strengthen TypeSignature and Coercion rule.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.