Is your feature request related to a problem or challenge? Part of

I've opened a draft PR: <a class="issue-link js-issue-link" data-error-text="Failed to

Are there any potential issues with simply using the existing <code class="notranslate

Thanks for the detailed write up <a class="user-mention notranslate" data-hovercard-ty

I have a WIP commit here: <a class="commit-link" data-hovercard-type="com

Make `CommonSubexprEliminate` faster by avoiding the use of strings,about apache/arrow-datafusion

Comments (8)

peter-toth commented on June 9, 2024 2

I've opened a draft PR: #10473 and will try to wrap it up in the following days.

from arrow-datafusion.

peter-toth commented on June 9, 2024 2

Here is one example API that I would love to implement with such a tree-node api: #10505

Thanks for sharing this @alamb. It's good to know that there are other possible usecases for this new API. #10473 seems to pass all tests now. I will extract the first commit of it into a separate PR today or tomorrow to add the new TreeNode API.

from arrow-datafusion.

peter-toth commented on June 9, 2024 1

I'm happy to take this.

from arrow-datafusion.

erratic-pattern commented on June 9, 2024

Are there any potential issues with simply using the existing Hash implementation of Expr to create HashSets?

Serveral other optimization passes use string names as keys for expressions in data structures. I am wondering if any of these could also be refactored to simply use HashSet<Expr> or HashSet<&Expr>

synthetic group by expressions for aggregates:

datafusion/datafusion/expr/src/logical_plan/builder.rs

Lines 1246 to 1270 in accce97

 fn add_group_by_exprs_from_dependencies( 

 mut group_expr: Vec<Expr>, 

 schema: &DFSchemaRef, 

 ) -> Result<Vec<Expr>> { 

 // Names of the fields produced by the GROUP BY exprs for example, `GROUP BY 

 // c1 + 1` produces an output field named `"c1 + 1"` 

 let mut group_by_field_names = group_expr 

 .iter() 

 .map(|e| e.display_name()) 

 .collect::<Result<Vec<_>>>()?; 

 if let Some(target_indices) = 

 get_target_functional_dependencies(schema, &group_by_field_names) 

 { 

 for idx in target_indices { 

 let expr = Expr::Column(Column::from(schema.qualified_field(idx))); 

 let expr_name = expr.display_name()?; 

 if !group_by_field_names.contains(&expr_name) { 

 group_by_field_names.push(expr_name); 

 group_expr.push(expr); 

 } 

 } 

 } 

 Ok(group_expr) 

 }

functional dependencies heavily uses display_name to represent group by exprs:
https://github.com/apache/datafusion/blob/main/datafusion/common/src/functional_dependencies.rs

decorrelate:

datafusion/datafusion/optimizer/src/decorrelate.rs

Line 65 in accce97

pub type ExprResultMap = HashMap<String, Expr>;

push down filter for aggregates:

datafusion/datafusion/optimizer/src/push_down_filter.rs

Lines 788 to 837 in accce97

 // We can push down Predicate which in groupby_expr. 

 let group_expr_columns = agg 

 .group_expr 

 .iter() 

 .map(|e| Ok(Column::from_qualified_name(e.display_name()?))) 

 .collect::<Result<HashSet<_>>>()?; 

 let predicates = split_conjunction_owned(filter.predicate.clone()); 

 let mut keep_predicates = vec![]; 

 let mut push_predicates = vec![]; 

 for expr in predicates { 

 let cols = expr.to_columns()?; 

 if cols.iter().all(|c| group_expr_columns.contains(c)) { 

 push_predicates.push(expr); 

 } else { 

 keep_predicates.push(expr); 

 } 

 } 

 // As for plan Filter: Column(a+b) > 0 -- Agg: groupby:[Column(a)+Column(b)] 

 // After push, we need to replace `a+b` with Column(a)+Column(b) 

 // So we need create a replace_map, add {`a+b` --> Expr(Column(a)+Column(b))} 

 let mut replace_map = HashMap::new(); 

 for expr in &agg.group_expr { 

 replace_map.insert(expr.display_name()?, expr.clone()); 

 } 

 let replaced_push_predicates = push_predicates 

 .iter() 

 .map(|expr| replace_cols_by_name(expr.clone(), &replace_map)) 

 .collect::<Result<Vec<_>>>()?; 

 let child = match conjunction(replaced_push_predicates) { 

 Some(predicate) => LogicalPlan::Filter(Filter::try_new( 

 predicate, 

 agg.input.clone(), 

 )?), 

 None => (*agg.input).clone(), 

 }; 

 let new_agg = filter 

 .input 

 .with_new_exprs(filter.input.expressions(), vec![child])?; 

 match conjunction(keep_predicates) { 

 Some(predicate) => LogicalPlan::Filter(Filter::try_new( 

 predicate, 

 Arc::new(new_agg), 

 )?), 

 None => new_agg, 

 } 

 }

single distinct to group by:

datafusion/datafusion/optimizer/src/single_distinct_to_groupby.rs

Lines 69 to 96 in accce97

 LogicalPlan::Aggregate(Aggregate { aggr_expr, .. }) => { 

 let mut fields_set = HashSet::new(); 

 let mut aggregate_count = 0; 

 for expr in aggr_expr { 

 if let Expr::AggregateFunction(AggregateFunction { 

 func_def: AggregateFunctionDefinition::BuiltIn(fun), 

 distinct, 

 args, 

 filter, 

 order_by, 

 null_treatment: _, 

 }) = expr 

 { 

 if filter.is_some() || order_by.is_some() { 

 return Ok(false); 

 } 

 aggregate_count += 1; 

 if *distinct { 

 for e in args { 

 fields_set.insert(e.canonical_name()); 

 } 

 } else if !matches!(fun, Sum | Min | Max) { 

 return Ok(false); 

 } 

 } 

 } 

 Ok(aggregate_count == aggr_expr.len() && fields_set.len() == 1) 

 }

datafusion/datafusion/optimizer/src/single_distinct_to_groupby.rs

Line 185 in accce97

&& group_fields_set.insert(args[0].display_name()?)

from arrow-datafusion.

peter-toth commented on June 9, 2024

Are there any potential issues with simply using the existing Hash implementation of Expr to create HashSets?

Serveral other optimization passes use string names as keys for expressions in data structures. I am wondering if any of these could also be refactored to simply use HashSet<Expr> or HashSet<&Expr>

Thanks for these references @erratic-pattern.

Background and general thoughts:

I'm only familiar with CSE code and in its case unfortunately non-unique stringified expression were used as keys of the map that stores the occurrance counts. This bug was introduced in #9871 and reverted in #10396. The issue with these colliding string keys are explained here in details: #10333 (comment).

Some thougths about CSE:

After #10396 we still use stringified expressions as keys (Identifier), but the strings we use encode whole expression subtrees. This is far from optimal and this ticket / my work in progress change would like to help with that.

In case of CSE we could use Expr as keys of the ExprStats map, but then we would need to clone Exprs when we fill up the ExprStats map during the first traversal. This would be particulary costly in CSE because we need to store not only the counts for all top level expressions, but the counts of all their descendant subexpressions.

We could also use &Expr as keys (and so we didn't need to clone the expressions), but there is a problem here. The current TreeNode::apply() / TreeNode::visit() APIs aren't capable to fill up such a HashMap<&Expr, ...> map. This is because of restricted TreeNode reference lifetimes used in closures / TreeNodeVisitor methods.
I.e. this currently doesn't work:

let e = sum((col("a") * (lit(1) - col("b"))) * (lit(1) + col("c")));
let mut m = HashMap::new();
e.apply(|e| {
    *m.entry(e).or_insert(0) += 1;
    Ok(TreeNodeRecursion::Continue)
});

println!("m: {:#?}", m);

This issue can be solved by adding new TreeNode APIs or fixing the current ones.
I have a WIP commit here: peter-toth@e844799 that adds TreeNode::apply_ref() / TreeNode::visit_ref().
Using apply_ref() in the above example would make it work, but I haven't opened a PR yet as there are a few things to consider:
a. We don't really want to add any more new APIs (especially if their puspose is similar to existing ones).
b. We can't change the lifetimes of references in the current apply() / visit() easily. This is because some TreeNode implementations are not compatible with that. (E.g. DynTreeNode doesn't have a method to get references to its children, LogicalPlan creates temprorary objects in its map_subqueries(), ...).

Despite the fact that my WIP commit adds new APIs, I would prefer and lean towards option b.. But since I'm only aware of this ticket that requires this change to the APIs, I haven't opened the PR yet.

Now there is another thing to consider if we want use &Expr as keys of ExprStats. The current CSE algorithm, that was added by the original author of CSE in DataFusion (and not myself), is very clever and does the following:
In the first traversal it:

Creates a mapping for each top level expression (this is called IdArray) that stores the preorder visit index of a node to an Identifier (of a subexpression tree).
And also creates a map (this is called ExprStats) that contains the Identifier -> count stats gathered for all top level expressions and their subexpressions.

This is very nice, because the second, rewriting traversal can use the preorder visit index again to look up the identifier first and then the count from the ExprStats map. Providing that an identifier is small, this can be much faster then using &Expr as keys because:

Computing hash() of an &Expr (instead of using preorder index) in the second traversal is costly if the expression is deep and contains lots of indirections (Boxes).
When we generate the identifiers in the first traversal we can use the traversal's bottom-up phase to build up identifiers from the current node and the identifiers of the node's children very effectively.

In my work in progress change for this issue I would like to finalize the:

TreeNode API changes required (maybe open a separate PR for it)
and replce the current String based identifier to a (u64, &Expr) like tuple/struct.
The first item contains a precomputed hash of the identifier. (As I mentioned, we can use the bottom-up phase of the first traversal to compute that effectively since this logic is already implmeneted in the CSE algorithm.) The overriden hash() of the struct should return this precomputed hash.
The second item is a &Expr that can be used in the struct's eq() implementation in case of hash collision.

Back to the original question of using `HashSet<Expr, ...>` or `HashSet<&Expr, ...>`:

I think both are accepable but CSE is special as the maps need to store all the descendant subexpressions as well and the impemented CSE algorithm seems to offer a way to implement a better identifier than just a simple &Expr.
I don't know the other referenced usecases but if collision of string names can happen there then we should definitely fix it.

from arrow-datafusion.

erratic-pattern commented on June 9, 2024

Thanks for the detailed write up @peter-toth . Though I did mention HashSet<Expr> specifically, my suggestion more generally goes along the lines of using the Hash implementation in some way to produce the identifiers. After looking at the code a bit more, I do see the cloning/lifetime issues with using Expr or &Expr as keys directly. I also did not consider the cost of re-computing hashes. I do think in that case it does make sense to pre-compute the hash instead.

I like the idea of generalizing the (u64, &Expr) struct into something reuseable across optimizations, as it seems to be a common pattern where we need to:

produce some unique identifier for an expression that can be stored in a data structure
use that identifier to generate aliases for newly generated expressions, or create a new Column/Field somewhere with that expression as a name. this can be done thanks to the &Expr in the struct which would allow us to call display_name
do so in a way that doesn't conflict with ownership/borrowing semantics. we might still run into borrowing issues because of the &Expr reference, but it's hard to say without trying to adapt this solution to other optimizers. Rc or Arc is a potential option as well. The struct could potentially be generic over Borrow to support any of these.
avoid recomputing the hash/key on every insert/lookup operation

Anyway, I don't want to over-abstract just yet, so for now just build something that works for CSE and then we can take it and see if it can be applied to any of the other optimizations.

I am curious if overriding hash() in this way will conflict with the Hash Eq property in some unforseen way. I think as long as we're constructing it such that the &Expr is always a reference to the Expr that produced the hash, it should be fine.

from arrow-datafusion.

peter-toth commented on June 9, 2024

I like the idea of generalizing the (u64, &Expr) struct into something reuseable across optimizations.

Honestly, I don't know those referenced usecases, but I feel (u64, &Expr) (and any Identifier in general) makes sense only for CSE (2 traversals, we can build up a preorder visit cache of Identifiers in the first traversal and second traversal is top-down) and not sure the others have the same characteristics... If that's not the case then it doesn't make sense to use Identifiers instead of Expr/&Exprs.

Anyways, I will try to open the PR with it next week and then feel free to generalize the idea for other usecases if it makes sense.

from arrow-datafusion.

alamb commented on June 9, 2024

I have a WIP commit here: peter-toth@e844799 that adds TreeNode::apply_ref() / TreeNode::visit_ref().
Using apply_ref() in the above example would make it work, but I haven't opened a PR yet as there are a few things to consider:
...
But since I'm only aware of this ticket that requires this change to the APIs, I haven't opened the PR yet.

Here is one example API that I would love to implement with such a tree-node api: #10505

I also ran into an example when trying to find embedded Subquerys in an Expr in

datafusion/datafusion/optimizer/src/scalar_subquery_to_join.rs

Lines 54 to 68 in 424757f

 fn extract_subquery_exprs( 

 &self, 

 predicate: &Expr, 

 alias_gen: Arc<AliasGenerator>, 

 ) -> Result<(Vec<(Subquery, String)>, Expr)> { 

 let mut extract = ExtractScalarSubQuery { 

 sub_query_info: vec![], 

 alias_gen, 

 }; 

 predicate 

 .clone() 

 .rewrite(&mut extract) 

 .data() 

 .map(|new_expr| (extract.sub_query_info, new_expr)) 

 }

from arrow-datafusion.

Make `CommonSubexprEliminate` faster by avoiding the use of strings about arrow-datafusion HOT 8 OPEN

Comments (8)

Background and general thoughts:

Some thougths about CSE:

Back to the original question of using `HashSet<Expr, ...>` or `HashSet<&Expr, ...>`:

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

	fn add_group_by_exprs_from_dependencies(
	mut group_expr: Vec<Expr>,
	schema: &DFSchemaRef,
	) -> Result<Vec<Expr>> {
	// Names of the fields produced by the GROUP BY exprs for example, `GROUP BY
	// c1 + 1` produces an output field named `"c1 + 1"`
	let mut group_by_field_names = group_expr
	.iter()
	.map(\|e\| e.display_name())
	.collect::<Result<Vec<_>>>()?;

	if let Some(target_indices) =
	get_target_functional_dependencies(schema, &group_by_field_names)
	{
	for idx in target_indices {
	let expr = Expr::Column(Column::from(schema.qualified_field(idx)));
	let expr_name = expr.display_name()?;
	if !group_by_field_names.contains(&expr_name) {
	group_by_field_names.push(expr_name);
	group_expr.push(expr);
	}
	}
	}
	Ok(group_expr)
	}

	// We can push down Predicate which in groupby_expr.
	let group_expr_columns = agg
	.group_expr
	.iter()
	.map(\|e\| Ok(Column::from_qualified_name(e.display_name()?)))
	.collect::<Result<HashSet<_>>>()?;

	let predicates = split_conjunction_owned(filter.predicate.clone());

	let mut keep_predicates = vec![];
	let mut push_predicates = vec![];
	for expr in predicates {
	let cols = expr.to_columns()?;
	if cols.iter().all(\|c\| group_expr_columns.contains(c)) {
	push_predicates.push(expr);
	} else {
	keep_predicates.push(expr);
	}
	}

	// As for plan Filter: Column(a+b) > 0 -- Agg: groupby:[Column(a)+Column(b)]
	// After push, we need to replace `a+b` with Column(a)+Column(b)
	// So we need create a replace_map, add {`a+b` --> Expr(Column(a)+Column(b))}
	let mut replace_map = HashMap::new();
	for expr in &agg.group_expr {
	replace_map.insert(expr.display_name()?, expr.clone());
	}
	let replaced_push_predicates = push_predicates
	.iter()
	.map(\|expr\| replace_cols_by_name(expr.clone(), &replace_map))
	.collect::<Result<Vec<_>>>()?;

	let child = match conjunction(replaced_push_predicates) {
	Some(predicate) => LogicalPlan::Filter(Filter::try_new(
	predicate,
	agg.input.clone(),
	)?),
	None => (*agg.input).clone(),
	};
	let new_agg = filter
	.input
	.with_new_exprs(filter.input.expressions(), vec![child])?;
	match conjunction(keep_predicates) {
	Some(predicate) => LogicalPlan::Filter(Filter::try_new(
	predicate,
	Arc::new(new_agg),
	)?),
	None => new_agg,
	}
	}

	LogicalPlan::Aggregate(Aggregate { aggr_expr, .. }) => {
	let mut fields_set = HashSet::new();
	let mut aggregate_count = 0;
	for expr in aggr_expr {
	if let Expr::AggregateFunction(AggregateFunction {
	func_def: AggregateFunctionDefinition::BuiltIn(fun),
	distinct,
	args,
	filter,
	order_by,
	null_treatment: _,
	}) = expr
	{
	if filter.is_some() \|\| order_by.is_some() {
	return Ok(false);
	}
	aggregate_count += 1;
	if *distinct {
	for e in args {
	fields_set.insert(e.canonical_name());
	}
	} else if !matches!(fun, Sum \| Min \| Max) {
	return Ok(false);
	}
	}
	}
	Ok(aggregate_count == aggr_expr.len() && fields_set.len() == 1)
	}

	fn extract_subquery_exprs(
	&self,
	predicate: &Expr,
	alias_gen: Arc<AliasGenerator>,
	) -> Result<(Vec<(Subquery, String)>, Expr)> {
	let mut extract = ExtractScalarSubQuery {
	sub_query_info: vec![],
	alias_gen,
	};
	predicate
	.clone()
	.rewrite(&mut extract)
	.data()
	.map(\|new_expr\| (extract.sub_query_info, new_expr))
	}

Comments (8)

Background and general thoughts:

Some thougths about CSE:

Back to the original question of using HashSet<Expr, ...> or HashSet<&Expr, ...>:

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Back to the original question of using `HashSet<Expr, ...>` or `HashSet<&Expr, ...>`: