Comments (17)
Hi @asuhan, does a starter tag imply that anyone just taking a look at the code base could possibly look into fixing this?
from heavydb.
Yes, it does. Ideally, people working on a task should announce their intention to avoid duplication of effort.
Here's what this task involves:
- Changes to the
ApproxCountDistinct
operator (in theMapDSqlOperatorTable
class) to make it accept an additional, optional parameter for the desired relative error. CheckPgILike
orRegexpLike
for an example of how to achieve that (the escape character for these two is optional). - Add the desired relative error as an additional field of
Analyzer::AggExpr
and use it inGroupByAndAggregate::initCountDistinctDescriptors
instead ofHLL_MASK_WIDTH
.
from heavydb.
May I take this one?
from heavydb.
@Smyatkin-Maxim Yes, that’d be great. Let me know if you need help.
from heavydb.
yes, this is good.
from heavydb.
@asuhan
Tried to understand it today. Either I'm missing some easy way to do it, or it's a bit more work than just updating code in the three places, because there are no other aggregates working with multiple arguments.
The changes in calcite are straightforward. The changes in C++ not that much. If I understand it correctly, I should find a way to parse json generated by calcite in parse_aggregate_expr (RelAlgAbstractInterpreter.cpp) to get the optional parameter into RexAgg and then forward it to AggExpr? I also see there are other code paths creating AggExpr instances, which I also should handle.
Btw, should the optional argument be Integer (e.g., 2%) or Decimal (e.g., 0.02 error rate)?
Will try to implement it tomorrow.
from heavydb.
Yes, you have to start from parse_aggregate_expr
and relax the operands requirements a little bit to account for the fact that we have two parameters now. You should only allow for an additional literal parameter for APPROX_COUNT_DISTINCT
and store it in RexAgg
and AggExpr
. RelAlgTranslator::translateAggregateRex
is the only place which really instantiates AggExpr
, the other occurrences either just copy it (ExpressionRewrite.cpp
, CalciteAdapter.cpp
) or synthesize a COUNT
aggregate for a cardinality query (in RelAlgExecutor.cpp
). The same additional information needs to be stored in the TargetInfo
structure and stored there by target_info
in ResultRows.h
.
Regarding the format of the argument, let's make it percent (2%).
from heavydb.
Ok, thanks
from heavydb.
As far as I understand, by design I should not (and can not currently) access the literal at parse_aggregate_expr
, because it's not part of an aggregate expression - it's part of input arguments. I should resolve it only in RelAlgTranslator::translateAggregateRex
from scalar_sources
. But for some reason optimizer decides that I don't need this literal and removes it in eliminate_dead_columns
. Which means that I won't see it later in scalar_sources
.
So, while I could parse it in parse_aggregate_expr
into AggRex
, it feels like a dirty hack, because it's not the place where operands should be resolved. So I probably have to come up with a fix to eliminate_dead_columns
I suppose?
from heavydb.
Or maybe this little hack is better than updating optimizer for the corner case of aggregate with two arguments. @asuhan , which way would you advice?
from heavydb.
@Smyatkin-Maxim I don't understand your comment about parse_aggregate_expr
. It returns a full aggregate expression which carries all the information: the type of the aggregate and the arguments. For APPROX_COUNT_DISTINCT
, the first argument is handled just like we do for all aggregates, you can leave it alone. The second argument becomes an additional integer / float (depends on your choice) field of RexAgg
; unlike the first argument, it's not a scalar source. You don't need to change eliminate_dead_columns
.
from heavydb.
Ok, perhaps optimizer works as it should - it's not something that can be learned in two days :)
I understand what parse_aggregate_expr
does. I mean that Calcite doesn't make this literal part of aggregate expression, so parse_aggregate_expr
can't see it.
For example, if we consider this query:
select approx_count_distinct(tree_dbh, 2) from nyc_trees_2015_683k;
it sees only this part of input JSON string:
{
"id": "2",
"relOp": "LogicalAggregate",
"fields": [
"EXPR$0"
],
"group": [],
"aggs": [
{
"agg": "APPROX_COUNT_DISTINCT",
"type": {
"type": "BIGINT",
"nullable": false
},
"distinct": false,
"operands": [
0,
1
]
}
]
}
While I need to access this part:
{
"id": "1",
"relOp": "LogicalProject",
"fields": [
"tree_dbh",
"$f1"
],
"exprs": [
{
"input": 4
},
{
"literal": 2,
"type": "DECIMAL",
"target_type": "INTEGER",
"scale": 0,
"precision": 1,
"type_scale": 0,
"type_precision": 10
}
]
}
For the first argument this is being resolved later in RelAlgTranslator::translateAggregateRex
, but the new argument is getting optimized out by the moment. So I'm asking if it's fine if I let parse_aggregate_expr
access this part of query. I'll have to change it's signature to something like this:
std::unique_ptr<const RexAgg> parse_aggregate_expr(const rapidjson::Value& expr,
const std::vector<std::shared_ptr<const RelAlgNode>>& inputs) {
from heavydb.
Ok, I now see the problem.The dead column elimination doesn't look at the second argument because it cannot -- RexAgg::getOperand()
doesn't take a position argument and only returns the first operand. Fortunately, we only call it from two places in the optimizer, so you can break the signature and then fix the call sites in get_live_ins
and renumber_rex_aggs
. Both are straightforward: the first one is purely additive (just add the second argument to the live set), the second doesn't really matter (no need for renumbering a literal operand).
from heavydb.
@asuhan Ok, just to make sure that I got you right:
- I don't change
parse_aggregate_expr
signature, just add the second operand support. - I do change signature of
RexAgg::getOperand()
- I do change optimizer a little bit in places you've just named to manage multiple arguments.
from heavydb.
Yup, I think this is the best course of action -- should work smoothly, if it doesn't we'll have a closer look.
from heavydb.
I've created a pull request for review. I'd like you to have a closer look into these things:
- Did I pick the correct formula for bitmap size calculation from relative error? I'm not 100% sure about it.
- I see there is
FunctionRef::analyze
which also creates anAggExpr
object. But it doesn't considerkAPPROXIMATE_COUNT_DISTINCT
aggregates at all. So, I didn't update it for approx_count. As far as I understand, it's from some old parser before Calcite and isn't being used anymore? Or am I wrong here?
from heavydb.
Thanks, I'll have a look tomorrow. Yes, FunctionRef::analyze
is effectively dead code we should delete, don't worry about it.
from heavydb.
Related Issues (20)
- [GPU Logic Bug] SELECT DISTINCT <column> FROM <table> ORDER BY 1 DESC LIMIT 10 Brings Errors HOT 1
- [GPU Error Bug] <column> NOT IN <column(overflow)> Brings Errors
- ERR_OUT_OF_CPU_MEM: Not enough host memory to execute the query HOT 2
- [GPU Error Bug] SELECT <column> FROM <table> WHERE <column> OR <column> OR CAST(<number> + CAST( <column> AS INT) AS BOOLEAN) Brings Errors
- [Crash Bug] INSERT INTO <table>(<column>, <column>) VALUES(TRUE, TRUE) Brings Errors
- [Crash Bug] SELECT <column> FROM <table> JOIN <table> ON FALSE Brings Errors HOT 1
- [Crash Bug] SELECT * FROM <table> JOIN <table> ON CAST(<number> AS BOOLEAN) WHERE FALSE Brings Errors HOT 2
- [Crash Bug] SELECT * FROM <table> JOIN <table> ON NULL WHERE FALSE Brings Errors HOT 2
- [GPU Logic Bug] SELECT DISTINCT <column> FROM <table> WHERE CAST(<column> AS INT) != 1 Brings Errors
- [GPU Error Bug] SELECT * FROM <table> WHERE ((<column> + <column>) < <column>) OR (<column> = <column>) Brings Errors HOT 1
- golang python HOT 10
- [GPU Error Bug] SELECT * FROM <table> JOIN ( SELECT ALL <number> FROM <table>) AS <alias> Brings Errors
- [GPU Error Bug] CAST(<column>+<column>(overflow) AS BOOLEAN) Brings Errors
- Evaluate using Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) HOT 1
- Intermitted SIGSEGV errors crashing heavyDB HOT 6
- Cannot import on an individual leaf. Please import from the Aggregator. HOT 1
- pinned memory HOT 2
- Failed to compile heavyDB; CUDA architecture not detected HOT 3
- Some demos on the website are not working or outdated HOT 1
- Error Running HeavyDB with Nvidia Nsight Compute: Broken Pipe in Thrift Connection HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from heavydb.