GithubHelp home page GithubHelp logo

pydruid's Introduction

pydruid

pydruid exposes a simple API to create, execute, and analyze Druid queries. pydruid can parse query results into Pandas DataFrame objects for subsequent data analysis -- this offers a tight integration between Druid, the SciPy stack (for scientific computing) and scikit-learn (for machine learning). pydruid can export query results into TSV or JSON for further processing with your favorite tool, e.g., R, Julia, Matlab, Excel. It provides both synchronous and asynchronous clients.

Additionally, pydruid implements the Python DB API 2.0, a SQLAlchemy dialect, and a provides a command line interface to interact with Druid.

To install:

pip install pydruid
# or, if you intend to use asynchronous client
pip install pydruid[async]
# or, if you intend to export query results into pandas
pip install pydruid[pandas]
# or, if you intend to do both
pip install pydruid[async, pandas]
# or, if you want to use the SQLAlchemy engine
pip install pydruid[sqlalchemy]
# or, if you want to use the CLI
pip install pydruid[cli]

Documentation: https://pythonhosted.org/pydruid/.

examples

The following exampes show how to execute and analyze the results of three types of queries: timeseries, topN, and groupby. We will use these queries to ask simple questions about twitter's public data set.

timeseries

What was the average tweet length, per day, surrounding the 2014 Sochi olympics?

from pydruid.client import *
from pylab import plt

query = PyDruid(druid_url_goes_here, 'druid/v2')

ts = query.timeseries(
    datasource='twitterstream',
    granularity='day',
    intervals='2014-02-02/p4w',
    aggregations={'length': doublesum('tweet_length'), 'count': doublesum('count')},
    post_aggregations={'avg_tweet_length': (Field('length') / Field('count'))},
    filter=Dimension('first_hashtag') == 'sochi2014'
)
df = query.export_pandas()
df['timestamp'] = df['timestamp'].map(lambda x: x.split('T')[0])
df.plot(x='timestamp', y='avg_tweet_length', ylim=(80, 140), rot=20,
        title='Sochi 2014')
plt.ylabel('avg tweet length (chars)')
plt.show()

alt text

topN

Who were the top ten mentions (@user_name) during the 2014 Oscars?

top = query.topn(
    datasource='twitterstream',
    granularity='all',
    intervals='2014-03-03/p1d',  # utc time of 2014 oscars
    aggregations={'count': doublesum('count')},
    dimension='user_mention_name',
    filter=(Dimension('user_lang') == 'en') & (Dimension('first_hashtag') == 'oscars') &
           (Dimension('user_time_zone') == 'Pacific Time (US & Canada)') &
           ~(Dimension('user_mention_name') == 'No Mention'),
    metric='count',
    threshold=10
)

df = query.export_pandas()
print df

   count                 timestamp user_mention_name
0   1303  2014-03-03T00:00:00.000Z      TheEllenShow
1     44  2014-03-03T00:00:00.000Z        TheAcademy
2     21  2014-03-03T00:00:00.000Z               MTV
3     21  2014-03-03T00:00:00.000Z         peoplemag
4     17  2014-03-03T00:00:00.000Z               THR
5     16  2014-03-03T00:00:00.000Z      ItsQueenElsa
6     16  2014-03-03T00:00:00.000Z           eonline
7     15  2014-03-03T00:00:00.000Z       PerezHilton
8     14  2014-03-03T00:00:00.000Z     realjohngreen
9     12  2014-03-03T00:00:00.000Z       KevinSpacey

groupby

What does the social network of users replying to other users look like?

from igraph import *
from cairo import *
from pandas import concat

group = query.groupby(
    datasource='twitterstream',
    granularity='hour',
    intervals='2013-10-04/pt12h',
    dimensions=["user_name", "reply_to_name"],
    filter=(~(Dimension("reply_to_name") == "Not A Reply")) &
           (Dimension("user_location") == "California"),
    aggregations={"count": doublesum("count")}
)

df = query.export_pandas()

# map names to categorical variables with a lookup table
names = concat([df['user_name'], df['reply_to_name']]).unique()
nameLookup = dict([pair[::-1] for pair in enumerate(names)])
df['user_name_lookup'] = df['user_name'].map(nameLookup.get)
df['reply_to_name_lookup'] = df['reply_to_name'].map(nameLookup.get)

# create the graph with igraph
g = Graph(len(names), directed=False)
vertices = zip(df['user_name_lookup'], df['reply_to_name_lookup'])
g.vs["name"] = names
g.add_edges(vertices)
layout = g.layout_fruchterman_reingold()
plot(g, "tweets.png", layout=layout, vertex_size=2, bbox=(400, 400), margin=25, edge_width=1, vertex_color="blue")

alt text

asynchronous client

pydruid.async_client.AsyncPyDruid implements an asynchronous client. To achieve that, it utilizes an asynchronous HTTP client from Tornado framework. The asynchronous client is suitable for use with async frameworks such as Tornado and provides much better performance at scale. It lets you serve multiple requests at the same time, without blocking on Druid executing your queries.

example

from tornado import gen
from pydruid.async_client import AsyncPyDruid
from pydruid.utils.aggregators import longsum
from pydruid.utils.filters import Dimension

client = AsyncPyDruid(url_to_druid_broker, 'druid/v2')

@gen.coroutine
def your_asynchronous_method_serving_top10_mentions_for_day(day
    top_mentions = yield client.topn(
        datasource='twitterstream',
        granularity='all',
        intervals="%s/p1d" % (day, ),
        aggregations={'count': doublesum('count')},
        dimension='user_mention_name',
        filter=(Dimension('user_lang') == 'en') & (Dimension('first_hashtag') == 'oscars') &
               (Dimension('user_time_zone') == 'Pacific Time (US & Canada)') &
               ~(Dimension('user_mention_name') == 'No Mention'),
        metric='count',
        threshold=10)

    # asynchronously return results
    # can be simply ```return top_mentions``` in python 3.x
    raise gen.Return(top_mentions)

thetaSketches

Theta sketch Post aggregators are built slightly differently to normal Post Aggregators, as they have different operators. Note: you must have the druid-datasketches extension loaded into your Druid cluster in order to use these. See the Druid datasketches documentation for details.

from pydruid.client import *
from pydruid.utils import aggregators
from pydruid.utils import filters
from pydruid.utils import postaggregator

query = PyDruid(url_to_druid_broker, 'druid/v2')
ts = query.groupby(
    datasource='test_datasource',
    granularity='all',
    intervals='2016-09-01/P1M',
    filter = ( filters.Dimension('product').in_(['product_A', 'product_B'])),
    aggregations={
        'product_A_users': aggregators.filtered(
            filters.Dimension('product') == 'product_A',
            aggregators.thetasketch('user_id')
            ),
        'product_B_users': aggregators.filtered(
            filters.Dimension('product') == 'product_B',
            aggregators.thetasketch('user_id')
            )
    },
    post_aggregations={
        'both_A_and_B': postaggregator.ThetaSketchEstimate(
            postaggregator.ThetaSketch('product_A_users') & postaggregator.ThetaSketch('product_B_users')
            )
    }
)

DB API

from pydruid.db import connect

conn = connect(host='localhost', port=8082, path='/druid/v2/sql/', scheme='http')
curs = conn.cursor()
curs.execute("""
    SELECT place,
           CAST(REGEXP_EXTRACT(place, '(.*),', 1) AS FLOAT) AS lat,
           CAST(REGEXP_EXTRACT(place, ',(.*)', 1) AS FLOAT) AS lon
      FROM places
     LIMIT 10
""")
for row in curs:
    print(row)

SQLAlchemy

from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *

engine = create_engine('druid://localhost:8082/druid/v2/sql/')  # uses HTTP by default :(
# engine = create_engine('druid+http://localhost:8082/druid/v2/sql/')
# engine = create_engine('druid+https://localhost:8082/druid/v2/sql/')

places = Table('places', MetaData(bind=engine), autoload=True)
print(select([func.count('*')], from_obj=places).scalar())

Column headers

In version 0.13.0 Druid SQL added support for including the column names in the response which can be requested via the "header" field in the request. This helps to ensure that the cursor description is defined (which is a requirement for SQLAlchemy query statements) regardless on whether the result set contains any rows. Historically this was problematic for result sets which contained no rows at one could not infer the expected column names.

Enabling the header can be configured via the SQLAlchemy URI by using the query parameter, i.e.,

engine = create_engine('druid://localhost:8082/druid/v2/sql?header=true')

Note the current default is false to ensure backwards compatibility but should be set to true for Druid versions >= 0.13.0.

Command line

$ pydruid http://localhost:8082/druid/v2/sql/
> SELECT COUNT(*) AS cnt FROM places
  cnt
-----
12345
> SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES;
TABLE_NAME
----------
test_table
COLUMNS
SCHEMATA
TABLES
> BYE;
GoodBye!

Contributing

Contributions are welcomed of course. We like to use black and flake8.

pip install -r requirements-dev.txt  # installs useful dev deps
pre-commit install  # installs useful commit hooks

pydruid's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pydruid's Issues

Tests unittest vs py.test

Is there any reason for which has been choosen py.test instead of the built-in unittest for testing? I would like to add some tests and I honestly don't know py.test, but unittest seems compatible. Can I write them without using py.test?

Add kerberos auth

Some Druid are running with Kerberos enabled, that would be nice to have pydruid to work with these kerberized instances.
I just saw that you are using the requests library to request the druid http api.
And I also saw that there is a requests-kerberos library to add kerberos auth.
Would it be possible to integrate it in pydruid?

The only requirement would be to add an argument in the requests calls, like this example:

import requests
from requests_kerberos import HTTPKerberosAuth, REQUIRED
kerberos_auth = HTTPKerberosAuth(mutual_authentication=REQUIRED, sanitize_mutual_error_response=False)
r = requests.get("https://windows.example.org/wsman", auth=kerberos_auth)

Support for dimensionspecs

I have some string data which contains numerical values. I would like to convert this data from string to long, so that I would be able to perform aggregations such as Min, Sum etc. As the link below suggests, druid allows you to perform this conversion by specifying a dimensionspec. Is this functionality supported in pydruid?

http://druid.io/docs/latest/querying/dimensionspecs.html

ThetasketchEstimate not working on Python2.7

When unit tests are run on Python2.7

TestPostAggregators.test_build_thetapostaggregator fails with

E       AssertionError: assert [{'field': <p...tchEstimate'}] == [{'field': {'f...tchEstimate'}]
E         At index 0 diff: {'field': <pydruid.utils.postaggregator.ThetaSketch instance at 0x1116e9ea8>, 'type': 'thetaSketchEstimate', 'name': 'pag1'} != {'field': {'fieldName': 'theta1', 'type': 'fieldAccess'}, 'type': 'thetaSketchEstimate', 'name': 'pag1'}

It looks like the object is being used in the dictionary rather than the object's post_aggregator. Will drop a PR to fix.

Support for timeout customization in AsyncPyDruid

Currently the AsyncPyDruid use default 20s timeout value in tornado HTTPRequest, which produce timeout errors frequently for me when doing some heavy queries. It'll be great if we can customize this value.

Support for TopNMetricSpec

Looks like passing a TopNMetricSpec for a TopN query is currently not supported since metric passed into topn is a string.

From topn docstring:

:param str metric: Metric over which to sort the specified dimension by

Custom duration granularities?

Based on what I read in the docs, Druid supports custom granularities like this:

{"type": "duration", "duration": 7200000}

Is there support for that in PyDruid?

Support for 'interval' filter.

Is there any 'Interval Filter' available ?

I couldn't find it in "type" : "interval" in filter code.

Filtering on a set of ISO 8601 intervals:

{
    "type" : "interval",
    "dimension" : "__time",
    "intervals" : [
      "2014-10-01T00:00:00.000Z/2014-10-07T00:00:00.000Z",
      "2014-11-15T00:00:00.000Z/2014-11-16T00:00:00.000Z"
    ]
}

Can't load plugin sqlalchemy.dialects:druid

Hello all!
I'm trying to create a datasource in superset, using pydruid. I can use pydruid cli successfully, but when I try to create a superset datasource, pointing to druid I'm getting this error: "Can't load plugin sqlalchemy.dialects:druid".

Any idea how to solve this?

Best regards.

Superset version
0.15.0 integrated with hadoop

Expected results
Integrate Druid with SQL lab

Actual results
Unable to integrate Druid with SQL lab

Steps to reproduce
Install pydruid and try to create a database using this plugin

Have Filter implement __eq__

I'd like to write assertions to check that the filter's my code generates are valid. Can Filters implement eq to allow comparisons to work?

Eg.

>>> f1 = Filter(value=1, dimension="test")
>>> f2 = Filter(value=1, dimension="test")
>>> f1 == f2
False

Readme out of date due to moving functions

Some of the functions/constructors like longsum and Field are not included when you import from pydruid.client import *, you need to go into utils, so I think the readme needs to be updated. I can do it myself if needed, but didn't want to give incorrect information since I'm unfamiliar with pydruid.

How to apply multiple filters to a group by query

How can i apply multiple filters to a group by query following does not work
filters = (pydruid.utils.filters.Dimension("unit")=='000721') & (pydruid.utils.filters.Dimension("val") > 0)
query1 = query.groupby(
datasource=dataSource,
granularity='minute',
#intervals='2016-08-01/p12w',
#intervals='2016-08-01/pt24h',
intervals='2016-08-11/2016-08-15',
dimensions=["GPN"],
filter=filters,
aggregations={"val_sum": ag.doublesum("val"),"val_count": ag.count("val")},
post_aggregations={"Avg": (pag.Field("val_sum") / pag.Field("val_count"))},
context={"timeout": 600000}#,
# limit_spec={
#"type": "default",
# "limit": 50,
# "columns" : ["sensor_val_sum","sensor_val_count"]
# }
)

Gives following error
Traceback (most recent call last):
File "Test_Query_2Filters.py", line 41, in
context={"timeout": 600000}#,
File "/usr/local/lib/python2.7/dist-packages/pydruid/client.py", line 191, in groupby
query = self.query_builder.groupby(kwargs)
File "/usr/local/lib/python2.7/dist-packages/pydruid/query.py", line 316, in groupby
return self.build_query(query_type, args)
File "/usr/local/lib/python2.7/dist-packages/pydruid/query.py", line 250, in build_query
query_dict[key] = Filter.build_filter(val)
File "/usr/local/lib/python2.7/dist-packages/pydruid/utils/filters.py", line 90, in build_filter
filter['fields'] = [Filter.build_filter(f) for f in filter['fields']]
File "/usr/local/lib/python2.7/dist-packages/pydruid/utils/filters.py", line 87, in build_filter
filter = filter_obj.filter['filter']
AttributeError: 'bool' object has no attribute 'filter'

misleading error message

Dear Deep,

I was trying out a groupby query on metrics. pyDruid told me I had a malformed query, but when I run the generated query through curl it works.

import pydruid.client
import datetime

bard_url = 'http://x.x.x.x:8080/'
endpoint = 'druid/v2/?pretty'
query = pydruid.client.pyDruid(bard_url,endpoint)

dataSource = 'mmx_metrics'
filters = (pydruid.client.Dimension("metric") == "query/time") & (pydruid.client.Dimension("service") == "druid/prod/bard")
intervals = [datetime.datetime.utcnow().isoformat() + '/PT5M']

foo = query.groupBy(dataSource=dataSource, intervals=intervals, granularity="minute", dimensions=['host','service'], aggregations = {"count": pydruid.client.doubleSum("count")}, filter=filters)

Gives me:

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-11-9d231bdb8e44> in <module>()
----> 1 foo = query.groupBy(dataSource=dataSource, intervals=intervals, granularity="minute", dimensions=['host','service'], aggregations = {"count": pydruid.client.doubleSum("count")}, filter=filters)

/usr/lib/python2.7/site-packages/pyDruid-0.1.7-py2.7.egg/pydruid/client.pyc in groupBy(self, **args)
    157                 self.query_dict = query_dict
    158                 self.query_type = 'groupby'
--> 159                 return self.post(query_dict)
    160 
    161         def segmentMetadata(self, **args):

/usr/lib/python2.7/site-packages/pyDruid-0.1.7-py2.7.egg/pydruid/client.pyc in post(self, query)
     47                         res.close()
     48                 except urllib2.HTTPError, e:
---> 49                         raise IOError('Malformed query: \n {0}'.format(json.dumps(self.query_dict, indent = 4)))
     50                 else:
     51                         self.result = self.parse()

IOError: Malformed query: 
 {
    "dimensions": [
        "host",
        "service"
    ],
    "aggregations": [
        {
            "type": "doubleSum",
            "fieldName": "count",
            "name": "count"
        }
    ],
    "filter": {
        "fields": [
            {
                "type": "selector",
                "dimension": "metric",
                "value": "query/time"
            },
            {
                "type": "selector",
                "dimension": "service",
                "value": "druid/prod/bard"
            }
        ],
        "type": "and"
    },
    "intervals": [
        "2013-12-06T00:38:38.760172/PT5M"
    ],
    "dataSource": "mmx_metrics",
    "granularity": "minute",
    "queryType": "groupBy"
}

I put the generated query into /tmp/query.druid and ran the following:

curl -X POST "http://x.x.x.x:8080/druid/v2/?pretty" -H 'content-type: application/json' -d @/tmp/query.druid

It returned the results I expected.

I saw this with both the pip installed version and the git version.

-Jeff

SubQueries Support

Hello All,

How i can generate subqueries using pydruid because datasource field only take either str or list?
""" ValueError: Datasource definition not valid. Must be string or list of strings """

Below is the sample query. On which I am passing query output of 1st query to another query as datasource.

{
  "queryType": "groupBy",
  "dataSource":{
    "type": "query",
    "query": {
      "queryType": "groupBy",
      "dataSource": "druid_source",
      "granularity": {"type": "period", "period": "P1M"},
      "dimensions": ["source_dim"],
      "aggregations": [
        { "type": "doubleMax", "name": "value", "fieldName": "stream_value" }
      ],
      "intervals": [ "2012-01-01T00:00:00.000/2020-01-03T00:00:00.000" ]
    }
  },
  "granularity": "hour",
  "dimensions": ["source_dim"],
  "aggregations": [
    { "type": "longSum", "name": "outerquerryvalue", "fieldName": "value" }
  ],
  "intervals": [ "2012-01-01T00:00:00.000/2020-01-03T00:00:00.000" ]
}

PyDruid Installation Error??

this comes up after running pip install pydruid:

Running setup.py egg_info for package pydruid
Traceback (most recent call last):
File "", line 14, in ?
File "/home/ctsai/build/pydruid/setup.py", line 30, in ?
tests_require=['pytest', 'six', 'mock'],
File "/usr/lib64/python2.4/distutils/core.py", line 110, in setup
_setup_distribution = dist = klass(attrs)
File "/usr/lib/python2.4/site-packages/setuptools/dist.py", line 219, in init
self.fetch_build_eggs(attrs.pop('setup_requires'))
File "/usr/lib/python2.4/site-packages/setuptools/dist.py", line 242, in fetch_build_eggs
for dist in working_set.resolve(
File "/usr/lib/python2.4/site-packages/pkg_resources.py", line 481, in resolve
dist = best[req.key] = env.best_match(req, self, installer)
File "/usr/lib/python2.4/site-packages/pkg_resources.py", line 717, in best_match
return self.obtain(req, installer) # try and download/install
File "/usr/lib/python2.4/site-packages/pkg_resources.py", line 729, in obtain
return installer(requirement)
File "/usr/lib/python2.4/site-packages/setuptools/dist.py", line 286, in fetch_build_egg
return cmd.easy_install(req)
File "/usr/lib/python2.4/site-packages/setuptools/command/easy_install.py", line 446, in easy_install
return self.install_item(spec, dist.location, tmpdir, deps)
File "/usr/lib/python2.4/site-packages/setuptools/command/easy_install.py", line 471, in install_item
dists = self.install_eggs(spec, download, tmpdir)
File "/usr/lib/python2.4/site-packages/setuptools/command/easy_install.py", line 655, in install_eggs
return self.build_and_install(setup_script, setup_base)
File "/usr/lib/python2.4/site-packages/setuptools/command/easy_install.py", line 930, in build_and_install
self.run_setup(setup_script, setup_base, args)
File "/usr/lib/python2.4/site-packages/setuptools/command/easy_install.py", line 919, in run_setup
run_setup(setup_script, args)
File "/usr/lib/python2.4/site-packages/setuptools/sandbox.py", line 26, in run_setup
DirectorySandbox(setup_dir).run(
File "/usr/lib/python2.4/site-packages/setuptools/sandbox.py", line 63, in run
return func()
File "/usr/lib/python2.4/site-packages/setuptools/sandbox.py", line 29, in
{'file':setup_script, 'name':'main'}
File "setup.py", line 9
with io.open('README.rst', encoding='utf-8') as readme:
^
SyntaxError: invalid syntax
Complete output from command python setup.py egg_info:
Traceback (most recent call last):

File "", line 14, in ?

File "/home/ctsai/build/pydruid/setup.py", line 30, in ?

tests_require=['pytest', 'six', 'mock'],

File "/usr/lib64/python2.4/distutils/core.py", line 110, in setup

_setup_distribution = dist = klass(attrs)

File "/usr/lib/python2.4/site-packages/setuptools/dist.py", line 219, in init

self.fetch_build_eggs(attrs.pop('setup_requires'))

File "/usr/lib/python2.4/site-packages/setuptools/dist.py", line 242, in fetch_build_eggs

for dist in working_set.resolve(

File "/usr/lib/python2.4/site-packages/pkg_resources.py", line 481, in resolve

dist = best[req.key] = env.best_match(req, self, installer)

File "/usr/lib/python2.4/site-packages/pkg_resources.py", line 717, in best_match

return self.obtain(req, installer) # try and download/install

File "/usr/lib/python2.4/site-packages/pkg_resources.py", line 729, in obtain

return installer(requirement)

File "/usr/lib/python2.4/site-packages/setuptools/dist.py", line 286, in fetch_build_egg

return cmd.easy_install(req)

File "/usr/lib/python2.4/site-packages/setuptools/command/easy_install.py", line 446, in easy_install

return self.install_item(spec, dist.location, tmpdir, deps)

File "/usr/lib/python2.4/site-packages/setuptools/command/easy_install.py", line 471, in install_item

dists = self.install_eggs(spec, download, tmpdir)

File "/usr/lib/python2.4/site-packages/setuptools/command/easy_install.py", line 655, in install_eggs

return self.build_and_install(setup_script, setup_base)

File "/usr/lib/python2.4/site-packages/setuptools/command/easy_install.py", line 930, in build_and_install

self.run_setup(setup_script, setup_base, args)

File "/usr/lib/python2.4/site-packages/setuptools/command/easy_install.py", line 919, in run_setup

run_setup(setup_script, args)

File "/usr/lib/python2.4/site-packages/setuptools/sandbox.py", line 26, in run_setup

DirectorySandbox(setup_dir).run(

File "/usr/lib/python2.4/site-packages/setuptools/sandbox.py", line 63, in run

return func()

File "/usr/lib/python2.4/site-packages/setuptools/sandbox.py", line 29, in

{'__file__':setup_script, '__name__':'__main__'}

File "setup.py", line 9

with io.open('README.rst', encoding='utf-8') as readme:

      ^

SyntaxError: invalid syntax


Command python setup.py egg_info failed with error code 1

Broker Load-balancing

Adding option to query Zookeeper for available brokers, for availability / load-balancing

Bug with bound filter for negative value

The 'alphaNumeric' parameter in the bound filter is useless when the dimension is numeric and has negative values .I find the solution to solve the problem that set the 'ordering' to 'numeric' instead of 'alphaNumberic',but this can not be supported in the lastest vesion. I hope that the api of pydruid can update with the druid.io .

Support for granularity spec

The current version only seems to support granularity that is defined as an enum in druid. It does not support the more generic JSON object scheme for granularity.

Lack of description for empty result set causes SQLAlchemy exception

SQLAlchemy throws an AttributeError when fetching from a ResultProxy when the underlying Druid query results in an empty result set, e.g. no rows, as the proxy is prematurely closed, i.e.,

>>> from sqlalchemy.engine import create_engine

>>> engine = create_engine('druid://localhost:8082/druid/v2/sql')
>>> result = engine.execute("SELECT SUM(x) FROM foo WHERE bar IS NULL")
>>> result.fetchall()
...
AttributeError: 'NoneType' object has no attribute 'fetchall'

The reason the ResultProxy is closed is because cursor.description is None. This results in result.cursor being None which is why the exception is thrown.

It seems that other dialects have a description irrespective of whether there are rows and hence don't suffer from the same issue. Note I'm uncertain whether there's even a viable solution as the Druid SQL REST API doesn't provide this information for an empty result set,

> cat query.json
{"query": "SELECT SUM(x) FROM foo WHERE bar IS NULL"}

> curl -XPOST -H 'Content-Type: application/json' http://localhost:8082/druid/v2/sql/ -d @query.json
[]

Note I know one can circumvent this issue by using a raw connection, however this example illustrates the behavior that Pandas uses for reading SQL.

to: @betodealmeida @mistercrunch

Use union of data sources

Thanks for the nice package; it seems to be useful! 馃憤

It is possible to combine several data sources together, like so:

{
       "type": "union",
       "dataSources": ["<string_value1>", "<string_value2>", "<string_value3>", ... ]
}

Is there a way to consider this case when using pydruid?

Query construction results in nested filters instead of an array

I'm trying to run the following query

selected_apps = ((dr.Dimension('appId') == 0) | (dr.Dimension('appId') == 1) | (dr.Dimension('appId') == 2))

query = druid.topn(datasource='sessions', granularity='all', intervals='2016-02-05/P7D', filter=selected_apps, aggregations={'sessions': dr.longsum('sessions')}, dimension='appId', metric='sessions')

Looking at the actual constructed query I get:

{
    "metric": "sessions",路
    "aggregations": [
        {
            "fieldName": "sessions",路
            "type": "longSum",路
            "name": "sessions"
        }
    ],路
    "dimension": "appId",路
    "filter": {
        "fields": [
            {
                "fields": [
                    {
                        "type": "selector",路
                        "dimension": "appId",路
                        "value": 0
                    },路
                    {
                        "type": "selector",路
                        "dimension": "appId",路
                        "value": 1
                    }
                ],路
                "type": "or"
            },路
            {
                "type": "selector",路
                "dimension": "appId",路
                "value": 2
            }
        ],路
        "type": "or"
    },路
    "intervals": "2016-02-05/P7D",路
    "dataSource": "sessions",路
    "granularity": "all",路
    "queryType": "topN"
}

Inside the filter object I was expecting the fields array to contain all selectors at the same level and not nested.

select/search/generic queries

Hi,

As far as I can see, pydruid does not currently support "new" query types, i.e. "select" and "search". Will be nice to have them :-).

Maybe it would also make sense to expose the __post method, so we are able to send queries directly? Might also be useful if somebody wants to use pydruid as a "middleware", with queries prepared somewhere else.

`Dimension` object has no attribute `in_`

I'm doing virtually the same thing as the last example on the github page, where I try to create a filter with Dimension(something).in_([...]), but the error in the title showed up. Is this a problem with python version? I'm using 3.5.3. I've also checked that my pydruid version is 0.3.1.

Filter class selector type is not implemented

I installed pydruid with pip.
Doing, Filter(type="selector", dimension="dim", value=val) returns,

'Filter type: {0} does not exist'.format(args['type']))
NotImplementedError: Filter type: selector does not exist

Why isn't a basic selector filter still implemented?

bad error for druid-sql not enabled

First time druid user here - I initiated my druid server and I found that when I tried opening the CLI or anything with pydruid I kept getting:

  File "/opt/anaconda3/bin/pydruid", line 11, in <module>
    sys.exit(main())
  File "/opt/anaconda3/lib/python3.6/site-packages/pydruid/console.py", line 170, in main
    words = get_autocomplete(connection)
  File "/opt/anaconda3/lib/python3.6/site-packages/pydruid/console.py", line 155, in get_autocomplete
    get_tables(connection)
  File "/opt/anaconda3/lib/python3.6/site-packages/pydruid/console.py", line 143, in get_tables
    cursor.execute('SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES')
  File "/opt/anaconda3/lib/python3.6/site-packages/pydruid/db/api.py", line 41, in g
    return f(self, *args, **kwargs)
  File "/opt/anaconda3/lib/python3.6/site-packages/pydruid/db/api.py", line 189, in execute
    first_row = next(results)
#
  File "/opt/anaconda3/lib/python3.6/site-packages/pydruid/db/api.py", line 269, in _stream_query
    payload = r.json()
  File "/opt/anaconda3/lib/python3.6/site-packages/requests/models.py", line 892, in json
#
    return complexjson.loads(self.text, **kwargs)
  File "/opt/anaconda3/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/opt/anaconda3/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/anaconda3/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Had to go through the source to realize that there is a .json() call to show the error message and that was failing. Would be a good idea to catch and mention this in a better way.
I realized my issue was that localhost:8082 wasn't running.

PyDruid 0.3.2?

With the merge of #72 we now have the ability to perform theta sketch operations. This functionality is needed in Superset, but we need a new PyDruid pip package to be pushed so that we can use this new feature.

Is there a release process / timeline to get a new package pushed? Is there anything I can do to help? My understand is that the setup.py script needs to be modified to list 0.3.2 and a git tag is added for that version.

Query Filter Or Syntax?

what is the syntax for OR conditions in the filter argument?

eg. filter = (Dimension('A')=='val1') & ((Dimension('B')=='val2')# | (Dimension('C')=='val3'))

This doesn't seem to work, leads to error:

File "./pydruid_query.py", line 150, in <module>
gfl2 = group_last2.export_tsv('group_last2_result.tsv')
File "/usr/lib/python2.6/site-packages/pydruid/query.py", line 99, in export_tsv
header = list(self.result[0]['event'].keys())
IndexError: list index out of range

name 'Dimension' is not defined

Hi,

I am trying pydruid for the first time on python 2.7, with wikipedia data source.
However I am getting the following error when trying to execute the following query in python.

`top_langs = query.topn(
datasource = "wikipedia",
granularity = "all",
intervals = "2013-06-01T00:00/2020-01-01T00",
dimension = "channel",
filter = Dimension("namespace") == "article",
aggregations = {"edit_count": longsum("count")},
metric = "edit_count",
threshold = 4
)

print top_langs # Do this if you want to see the raw JSON`

`NameError Traceback (most recent call last)
in ()
4 intervals = "2013-06-01T00:00/2020-01-01T00",
5 dimension = "channel",
----> 6 filter = Dimension("namespace") == "article",
7 aggregations = {"edit_count": longsum("count")},
8 metric = "edit_count",

NameError: name 'Dimension' is not defined`

Exception when `filter=None`

When building query if filter=None then an exception occurs:

File "/Users/se7entyse7en/Envs/viralize-web/lib/python2.7/site-packages/pydruid/utils/filters.py", line 61, in build_filter
    return filter_obj.filter['filter']
 AttributeError: 'NoneType' object has no attribute 'filter'

Re-upload version 0.2.1 to pypi

It looks like releasing new version, you guys have removed 0.2.1 from pypi.

I'd ask you to re-upload that as we've have a strict dependency upgrade policy, and our requirements are locked against specific version of packages, including pydruid.

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I am trying to read druid data base using python 3.5 and DB API of pydruid; however whenever I run the execute statement I get the error:

Code I am running:
from pydruid.db import connect
conn = connect(host='XXXXXXX', port=8082, path='/druid/v2', scheme='http')
curs = conn.cursor()
curs.execute("""SELECT * FROM wikipedia LIMIT 10""")

Error:
_---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
in ()
----> 1 curs.execute("""SELECT * FROM wikipedia LIMIT 10""")

/apps/cmor/anaconda3/lib/python3.5/site-packages/pydruid-0.3.1-py3.5.egg/pydruid/db/api.py in g(self, *args, **kwargs)
39 raise exceptions.Error(
40 '{klass} already closed'.format(klass=self.class.name))
---> 41 return f(self, *args, **kwargs)
42 return g
43

/apps/cmor/anaconda3/lib/python3.5/site-packages/pydruid-0.3.1-py3.5.egg/pydruid/db/api.py in execute(self, operation, parameters)
187 # let's consume it and insert it back.
188 results = self._stream_query(query)
--> 189 first_row = next(results)
190 self._results = itertools.chain([first_row], results)
191

/apps/cmor/anaconda3/lib/python3.5/site-packages/pydruid-0.3.1-py3.5.egg/pydruid/db/api.py in _stream_query(self, query)
267 # raise any error messages
268 if r.status_code != 200:
--> 269 payload = r.json()
270 msg = (
271 '{error} ({errorClass}): {errorMessage}'.format(**payload)

/apps/cmor/anaconda3/lib/python3.5/site-packages/requests/models.py in json(self, **kwargs)
892 # used.
893 pass
--> 894 return complexjson.loads(self.text, **kwargs)
895
896 @Property

/apps/cmor/anaconda3/lib/python3.5/json/init.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
317 parse_int is None and parse_float is None and
318 parse_constant is None and object_pairs_hook is None and not kw):
--> 319 return _default_decoder.decode(s)
320 if cls is None:
321 cls = JSONDecoder

/apps/cmor/anaconda3/lib/python3.5/json/decoder.py in decode(self, s, _w)
337
338 """
--> 339 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
340 end = _w(s, end).end()
341 if end != len(s):

/apps/cmor/anaconda3/lib/python3.5/json/decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)_

pydruid installation failinf

 pip install pydruid 
Collecting pydruid
  Using cached https://files.pythonhosted.org/packages/32/91/4be6f902d50f22fc6b9e2eecffbef7d00989ba477e9c8e034074186cd10c/pydruid-0.4.2.tar.gz
    Complete output from command python setup.py egg_info:
    Download error on https://pypi.org/simple/pytest-runner/: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:645) -- Some packages may not be found!
    Couldn't find index page for 'pytest-runner' (maybe misspelled?)
    Download error on https://pypi.org/simple/: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:645) -- Some packages may not be found!
    No local packages or working download links found for pytest-runner
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/6g/xcjdd64j6clctps0_25h_n3h0000gp/T/pip-install-iatsyysf/pydruid/setup.py", line 44, in <module>
        include_package_data=True,
      File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/setuptools/__init__.py", line 128, in setup
        _install_setup_requires(attrs)
      File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/setuptools/__init__.py", line 123, in _install_setup_requires
        dist.fetch_build_eggs(dist.setup_requires)
      File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/setuptools/dist.py", line 504, in fetch_build_eggs
        replace_conflicting=True,
      File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pkg_resources/__init__.py", line 774, in resolve
        replace_conflicting=replace_conflicting
      File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pkg_resources/__init__.py", line 1057, in best_match
        return self.obtain(req, installer)
      File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pkg_resources/__init__.py", line 1069, in obtain
        return installer(requirement)
      File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/setuptools/dist.py", line 571, in fetch_build_egg
        return cmd.easy_install(req)
      File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/setuptools/command/easy_install.py", line 667, in easy_install
        raise DistutilsError(msg)
    distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('pytest-runner')
    ```
    ----------------------------------------
hitting on MAC OS .. 

Version bump

Is it possibile to bump a new version (maybe 0.3.1) including the in filter support?

Issues with .json

Please i need help, i am new to python i am following a training video from Udemy build 10 world real applications,
i am about building the dictionary but i am stuck with the errors below.
i don't know what else to do. i will appreciate any help concering how to fix this error.

thanks

parse_constant=parse_constant, object_pairs_hook=obje
ct_pairs_hook, **kw)
  File "C:\Users\learn\AppData\Local\Programs\Python\Pyth
on37-32\lib\json\__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "C:\Users\learn\AppData\Local\Programs\Python\Pyth
on37-32\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\learn\AppData\Local\Programs\Python\Pyth
on37-32\lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value
) from None
json.decoder.JSONDecodeError: Expecting value: line 1 col
umn 1 (char 0)

Versions issues

It seems like the github master branch corresponds to version 0.2 but on pypi you can find version 0.2.1. Is this intended?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    馃枛 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 馃搳馃搱馃帀

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google 鉂わ笍 Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.