GithubHelp home page GithubHelp logo

dqe_idx_pg's Introduction

Hex.pm

dqe_idx_pg

Dalmatiner query engine indexer module for Postgres.

Setting up the schema

Creating new database

It is easiest to just create new database with up to date schema.

$ createdb metric_metadata;
$ psql metric_metadata < priv/schema.sql

Migrating schema from version <= 0.3.6

After version 0.3.6 there was introduced significant schema change that require manual sql migration.

You need to prepare database to use new column and copy data over. At this time you should stop any processes writing to database (you still can keep processes that read data running).

CREATE EXTENSION hstore;
ALTER TABLE metrics ADD COLUMN dimensions hstore;
UPDATE metrics AS m SET dimensions = (
  SELECT hstore(array_agg(d.namespace || ':' || d.name), array_agg(d.value))
    FROM dimensions AS d
    WHERE d.metric_id = m.id
    GROUP BY d.metric_id)
  WHERE dimensions IS NULL;
ALTER INDEX metrics_idx RENAME TO metrics_collection_metric_bucket_key_idx;
CREATE INDEX CONCURRENTLY ON metrics USING btree(collection, akeys(dimensions));
CREATE INDEX CONCURRENTLY ON metrics USING btree(collection, metric, akeys(dimensions));
CREATE INDEX CONCURRENTLY ON metrics USING GIST (dimensions);

Now you should upgrade all applications to most recent version. Start processes writing to database as soon as they are upgraded. You will need also to restart all remaining processes, so they start using new version of library.

Once you upgrade your code and make sure everything is working, you can clean up parts of old schema that is no longer used.

DROP TABLE dimensions;
ALTER TABLE metrics DROP COLUMN id;
DROP INDEX metrics_idx_collection;
DROP INDEX metrics_idx_collection_metric;
DROP INDEX metrics_idx_id_collection_metric;
DROP INDEX metrics_idx_metric;

Build

$ rebar3 compile

Running EQC tests

Tests are included with this application that verify the syntactic correctness of all SQL statements used with the index. In addition, these SQL statements are verified against the latest schema included in priv/schema.

A working 9.1 or above Postgres installation is required in order to run the tests. The scripts included in the priv directory setup an isolated instance of Postgres in the datadir directory.

./priv/setup_test_db.sh # This sets up an installation of Postgres in datadir/

rebar3 as eqc eqc # Runs the tests

./priv/stop_test_db.sh # This stops the instance of Postgres in datadir/

Alternatively, use make to run all:

make eqc-test # Runs setup, EQC and teardown scripts

dqe_idx_pg's People

Contributors

davecromberge avatar licenser avatar starstuck avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

dqe_idx_pg's Issues

Multi part expand query is very slow

When we run a query having multiple globs, it takes very long to expand metrics, 100 times longer than queries expanding only one.

For example dalmatiner query

SELECT 
  sum(sum('5c7df5ed-5ba9-4d11-bbfe-b99a63e0054e'.'stats'.'counters'.'web'.'login'.*.'successful'.'count' BUCKET '5c'), 5m) AS 'Successful Logins',
  sum(sum('5c7df5ed-5ba9-4d11-bbfe-b99a63e0054e'.'stats'.'counters'.'web'.'register'.*.'successful'.'count' BUCKET '5c'), 5m) AS 'Successful Registers'
  BETWEEN 1468837800 AND now

Will trigger SQL query to expand wildcards:

SELECT DISTINCT key FROM metrics WHERE bucket = '5c' AND
((id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'key_length' AND value = '8') AND
     (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'ddb' AND value = '5c7df5ed-5ba9-4d11-bbfe-b99a63e0054e') AND
         (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_2' AND value = 'stats') AND
             (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_3' AND value = 'counters') AND
                 (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_4' AND value = 'web') AND
                     (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_5' AND value = 'register') AND
                         (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_7' AND value = 'successful') AND
                             id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_8' AND value = 'count'))))))))
 OR
 (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'key_length' AND value = '8') AND
     (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_1' AND value = '5c7df5ed-5ba9-4d11-bbfe-b99a63e0054e') AND
         (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_2' AND value = 'stats') AND
             (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_3' AND value = 'counters') AND
                 (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_4' AND value ='web') AND
                     (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_5' AND value = 'login') AND
                         (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_7' AND value = 'successful') AND
                             id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_8' AND value = 'count')))))))));

This query will take over 4sec to run.

Strangely, if I run just half of this query, like:

SELECT DISTINCT key FROM metrics WHERE bucket = '5c' AND
  (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'key_length' AND value = '8') AND
     (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_1' AND value = '5c7df5ed-5ba9-4d11-bbfe-b99a63e0054e') AND
         (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_2' AND value = 'stats') AND
             (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_3' AND value = 'counters') AND
                 (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_4' AND value = 'web') AND
                     (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_5' AND value = 'register') AND
                         (id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_7' AND value = 'successful') AND
                             id IN (SELECT metric_id FROM dimensions WHERE  namespace = 'ddb' AND name = 'part_8' AND value = 'count'))))))));

, it will take just 20ms.

Add method to traverse metric tree, without fetching whole list

Some collections may end up having very big metric tree, in tens, even hundreds of thousand metrics. For those collection getting and sending all metrics over a network is very slow.

I think it would be useful to add indexer methods that would allow fetching one level of metrics at a time. We tend to present metrics as a tree in UI anyway, so having api to pull children of single node would be natural extension of that approach.

`JOIN` and `AND`

The naive approach of doing the query has a problem when it comes to 'AND'

generating a query from this:

{<<"fifo">>, dproto:metric_from_list([<<"action">>, <<"count">>]), {'and', {<<"host">>, <<"[email protected]">>}, {<<"service">>, <<"sniffle">>}}}

results on the following SQL statment

SELECT DISTINCT bucket, key FROM metrics JOIN tags ON tags.metric_id = metrics.id WHERE collection = $1 AND metric = $2 AND ((name = $3 AND value = $4) AND (name = $5 AND value = $6))

This will always return an empty set, as the AND applies to the same record wich never can be true (we ask for name to be = $3 AND $5 wich is a constant false.

Not yet sure how to best solve this, we might to adjust how the query is generated

metrics/1 needs to return the namespace

Currently the function only returns metrics, this is problematic as it will be impossible to figure out the namespace of the metrics that way:

metrics(Collection) when is_binary(Collection) ->

This also requires changes in the dqe_idx spec.

@szarsti since it was your addition I'll give this to you as you probably need to update your internal code as well

Consider using postgres ltree module for metric paths

I would like to suggest some research into benefits of switching metric path from array to 'label tree' data structure provided by optional postgresql module. That module provides extensive facilities for searching through label trees which could be especially useful for pulling one level of metrics (metric tree traversal).

If that lands in indexer we could later on easily build full blown path filtering based on querying facilities provided by ltree module.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.