Add a second, optional parameter to APPROX_COUNT_DISTINCT</c

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Make `APPROX_COUNT_DISTINCT` error configurable about heavydb HOT 17 CLOSED

heavyai commented on July 20, 2024

Make `APPROX_COUNT_DISTINCT` error configurable

from heavydb.

Comments (17)

shahsyed-pcln commented on July 20, 2024

Hi @asuhan, does a starter tag imply that anyone just taking a look at the code base could possibly look into fixing this?

from heavydb.

asuhan commented on July 20, 2024

Yes, it does. Ideally, people working on a task should announce their intention to avoid duplication of effort.

Here's what this task involves:

Changes to the ApproxCountDistinct operator (in the MapDSqlOperatorTable class) to make it accept an additional, optional parameter for the desired relative error. Check PgILike or RegexpLike for an example of how to achieve that (the escape character for these two is optional).
Add the desired relative error as an additional field of Analyzer::AggExpr and use it in GroupByAndAggregate::initCountDistinctDescriptors instead of HLL_MASK_WIDTH.

from heavydb.

Smyatkin-Maxim commented on July 20, 2024

May I take this one?

from heavydb.

asuhan commented on July 20, 2024

@Smyatkin-Maxim Yes, that’d be great. Let me know if you need help.

from heavydb.

billmaimone commented on July 20, 2024

yes, this is good.

from heavydb.

Smyatkin-Maxim commented on July 20, 2024

@asuhan
Tried to understand it today. Either I'm missing some easy way to do it, or it's a bit more work than just updating code in the three places, because there are no other aggregates working with multiple arguments.
The changes in calcite are straightforward. The changes in C++ not that much. If I understand it correctly, I should find a way to parse json generated by calcite in parse_aggregate_expr (RelAlgAbstractInterpreter.cpp) to get the optional parameter into RexAgg and then forward it to AggExpr? I also see there are other code paths creating AggExpr instances, which I also should handle.

Btw, should the optional argument be Integer (e.g., 2%) or Decimal (e.g., 0.02 error rate)?
Will try to implement it tomorrow.

from heavydb.

asuhan commented on July 20, 2024

Yes, you have to start from parse_aggregate_expr and relax the operands requirements a little bit to account for the fact that we have two parameters now. You should only allow for an additional literal parameter for APPROX_COUNT_DISTINCT and store it in RexAgg and AggExpr. RelAlgTranslator::translateAggregateRex is the only place which really instantiates AggExpr, the other occurrences either just copy it (ExpressionRewrite.cpp, CalciteAdapter.cpp) or synthesize a COUNT aggregate for a cardinality query (in RelAlgExecutor.cpp). The same additional information needs to be stored in the TargetInfo structure and stored there by target_info in ResultRows.h.

Regarding the format of the argument, let's make it percent (2%).

from heavydb.

Smyatkin-Maxim commented on July 20, 2024

Ok, thanks

from heavydb.

Smyatkin-Maxim commented on July 20, 2024

As far as I understand, by design I should not (and can not currently) access the literal at parse_aggregate_expr, because it's not part of an aggregate expression - it's part of input arguments. I should resolve it only in RelAlgTranslator::translateAggregateRex from scalar_sources. But for some reason optimizer decides that I don't need this literal and removes it in eliminate_dead_columns. Which means that I won't see it later in scalar_sources.
So, while I could parse it in parse_aggregate_expr into AggRex, it feels like a dirty hack, because it's not the place where operands should be resolved. So I probably have to come up with a fix to eliminate_dead_columns I suppose?

from heavydb.

Smyatkin-Maxim commented on July 20, 2024

Or maybe this little hack is better than updating optimizer for the corner case of aggregate with two arguments. @asuhan , which way would you advice?

from heavydb.

asuhan commented on July 20, 2024

@Smyatkin-Maxim I don't understand your comment about parse_aggregate_expr. It returns a full aggregate expression which carries all the information: the type of the aggregate and the arguments. For APPROX_COUNT_DISTINCT, the first argument is handled just like we do for all aggregates, you can leave it alone. The second argument becomes an additional integer / float (depends on your choice) field of RexAgg; unlike the first argument, it's not a scalar source. You don't need to change eliminate_dead_columns.

from heavydb.

Smyatkin-Maxim commented on July 20, 2024

Ok, perhaps optimizer works as it should - it's not something that can be learned in two days :)
I understand what parse_aggregate_expr does. I mean that Calcite doesn't make this literal part of aggregate expression, so parse_aggregate_expr can't see it.
For example, if we consider this query:

select approx_count_distinct(tree_dbh, 2) from nyc_trees_2015_683k;

it sees only this part of input JSON string:

    {
      "id": "2",
      "relOp": "LogicalAggregate",
      "fields": [
        "EXPR$0"
      ],
      "group": [],
      "aggs": [
        {
          "agg": "APPROX_COUNT_DISTINCT",
          "type": {
            "type": "BIGINT",
            "nullable": false
          },
          "distinct": false,
          "operands": [
            0,
            1
          ]
        }
      ]
    }

While I need to access this part:

    {
      "id": "1",
      "relOp": "LogicalProject",
      "fields": [
        "tree_dbh",
        "$f1"
      ],
      "exprs": [
        {
          "input": 4
        },
        {
          "literal": 2,
          "type": "DECIMAL",
          "target_type": "INTEGER",
          "scale": 0,
          "precision": 1,
          "type_scale": 0,
          "type_precision": 10
        }
      ]
    }

For the first argument this is being resolved later in RelAlgTranslator::translateAggregateRex, but the new argument is getting optimized out by the moment. So I'm asking if it's fine if I let parse_aggregate_expr access this part of query. I'll have to change it's signature to something like this:

std::unique_ptr<const RexAgg> parse_aggregate_expr(const rapidjson::Value& expr,
                                                   const std::vector<std::shared_ptr<const RelAlgNode>>& inputs) {

from heavydb.

asuhan commented on July 20, 2024

Ok, I now see the problem.The dead column elimination doesn't look at the second argument because it cannot -- RexAgg::getOperand() doesn't take a position argument and only returns the first operand. Fortunately, we only call it from two places in the optimizer, so you can break the signature and then fix the call sites in get_live_ins and renumber_rex_aggs. Both are straightforward: the first one is purely additive (just add the second argument to the live set), the second doesn't really matter (no need for renumbering a literal operand).

from heavydb.

Smyatkin-Maxim commented on July 20, 2024

@asuhan Ok, just to make sure that I got you right:

I don't change parse_aggregate_expr signature, just add the second operand support.
I do change signature of RexAgg::getOperand()
I do change optimizer a little bit in places you've just named to manage multiple arguments.

from heavydb.

asuhan commented on July 20, 2024

Yup, I think this is the best course of action -- should work smoothly, if it doesn't we'll have a closer look.

from heavydb.

Smyatkin-Maxim commented on July 20, 2024

I've created a pull request for review. I'd like you to have a closer look into these things:

Did I pick the correct formula for bitmap size calculation from relative error? I'm not 100% sure about it.
I see there is FunctionRef::analyze which also creates an AggExpr object. But it doesn't consider kAPPROXIMATE_COUNT_DISTINCT aggregates at all. So, I didn't update it for approx_count. As far as I understand, it's from some old parser before Calcite and isn't being used anymore? Or am I wrong here?

from heavydb.

asuhan commented on July 20, 2024

Thanks, I'll have a look tomorrow. Yes, FunctionRef::analyze is effectively dead code we should delete, don't worry about it.

from heavydb.

Make `APPROX_COUNT_DISTINCT` error configurable about heavydb HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs