GithubHelp home page GithubHelp logo

googleapis / python-bigquery-pandas Goto Github PK

View Code? Open in Web Editor NEW
421.0 63.0 118.0 1.34 MB

Google BigQuery connector for pandas

Home Page: https://googleapis.dev/python/pandas-gbq/latest/index.html

License: BSD 3-Clause "New" or "Revised" License

Python 88.83% Shell 10.43% Dockerfile 0.74%
pandas bigquery data

python-bigquery-pandas's Introduction

pandas-gbq

preview pypi versions

pandas-gbq is a package providing an interface to the Google BigQuery API from pandas.

Installation

Install latest release version via conda

$ conda install pandas-gbq --channel conda-forge

Install latest release version via pip

$ pip install pandas-gbq

Install latest development version

$ pip install git+https://github.com/googleapis/python-bigquery-pandas.git

Usage

Perform a query

import pandas_gbq

result_dataframe = pandas_gbq.read_gbq("SELECT column FROM dataset.table WHERE value = 'something'")

Upload a dataframe

import pandas_gbq

pandas_gbq.to_gbq(dataframe, "dataset.table")

More samples

See the pandas-gbq documentation for more details.

python-bigquery-pandas's People

Contributors

aktech avatar aribray avatar blose avatar bsolomon1124 avatar bwanglzu avatar chalmerlowe avatar dependabot[bot] avatar gcf-owl-bot[bot] avatar google-cloud-policy-bot[bot] avatar jasonqng avatar johnpaton avatar jreback avatar kiraksi avatar linchin avatar max-sixty avatar melissachang avatar meredithslota avatar mr-mcox avatar nicoa avatar parthea avatar pbudzyns avatar release-please[bot] avatar renovate-bot avatar rhoboro avatar robertlacok avatar shantanukumar avatar tswast avatar tworec avatar vreyespue avatar yokomotod avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-bigquery-pandas's Issues

Google auth error - invalid_grant: Token has been expired or revoked

From #76 , unit test TestGBQConnectorIntegrationWithLocalUserAccountAuth.test_get_user_account_credentials_bad_file_returns_credentials was failing for @hagino3000 with invalid_grant: Token has been expired or revoked. I believe I've seen this error also. I'll try to reproduce this.

========================================================================== FAILURES ===========================================================================
_____________________ TestGBQConnectorIntegrationWithLocalUserAccountAuth.test_get_user_account_credentials_bad_file_returns_credentials ______________________

self = <pandas_gbq.tests.test_gbq.TestGBQConnectorIntegrationWithLocalUserAccountAuth object at 0x11327ba90>

    def test_get_user_account_credentials_bad_file_returns_credentials(self):
        import mock
        from google.auth.credentials import Credentials
>       with mock.patch('__main__.open', side_effect=IOError()):

pandas_gbq/tests/test_gbq.py:231:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../env/lib/python3.6/site-packages/mock.py:1268: in __enter__
    original, local = self.get_original()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <mock._patch object at 0x112f1c978>

    def get_original(self):
        target = self.getter()
        name = self.attribute

        original = DEFAULT
        local = False

        try:
            original = target.__dict__[name]
        except (AttributeError, KeyError):
            original = getattr(target, name, DEFAULT)
        else:
            local = True

        if not self.create and original is DEFAULT:
            raise AttributeError(
>               "%s does not have the attribute %r" % (target, name)
            )
E           AttributeError: <module '__main__' from '/Users/t-nishibayashi/dev/workspace/BigQuery-Python-dev/env/bin/pytest'> does not have the attribute 'open
'

../../env/lib/python3.6/site-packages/mock.py:1242: AttributeError
__________________________ TestGBQConnectorIntegrationWithLocalUserAccountAuth.test_get_user_account_credentials_returns_credentials __________________________

self = <pandas_gbq.tests.test_gbq.TestGBQConnectorIntegrationWithLocalUserAccountAuth object at 0x11301d940>

    def test_get_user_account_credentials_returns_credentials(self):
        from google.auth.credentials import Credentials
>       credentials = self.sut.get_user_account_credentials()

pandas_gbq/tests/test_gbq.py:237:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas_gbq/gbq.py:340: in get_user_account_credentials
    credentials = self.load_user_account_credentials()
pandas_gbq/gbq.py:298: in load_user_account_credentials
    credentials.refresh(request)
../../env/lib/python3.6/site-packages/google/oauth2/credentials.py:126: in refresh
    self._client_secret))
../../env/lib/python3.6/site-packages/google/oauth2/_client.py:189: in refresh_grant
    response_data = _token_endpoint_request(request, token_uri, body)
../../env/lib/python3.6/site-packages/google/oauth2/_client.py:109: in _token_endpoint_request
    _handle_error_response(response_body)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

response_body = '{\n  "error" : "invalid_grant",\n  "error_description" : "Token has been expired or revoked."\n}'

    def _handle_error_response(response_body):
        """"Translates an error response into an exception.

        Args:
            response_body (str): The decoded response data.

        Raises:
            google.auth.exceptions.RefreshError
        """
        try:
            error_data = json.loads(response_body)
            error_details = '{}: {}'.format(
                error_data['error'],
                error_data.get('error_description'))
        # If no details could be extracted, use the response data.
        except (KeyError, ValueError):
            error_details = response_body

        raise exceptions.RefreshError(
>           error_details, response_body)
E       google.auth.exceptions.RefreshError: ('invalid_grant: Token has been expired or revoked.', '{\n  "error" : "invalid_grant",\n  "error_description" : "T
oken has been expired or revoked."\n}')

../../env/lib/python3.6/site-packages/google/oauth2/_client.py:59: RefreshError

TST: Improve unit tests for GoogleCredentials.get_application_default()

The test test_get_application_default_credentials_returns_credentials() : https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/tests/test_gbq.py#L297 is skipped if _check_if_can_get_correct_default_credentials() returns false. This happens when default credentials are not available. In order to improve test coverage, specifically on Travis, we could (temporarily) populate the default credentials during testing in order to exercise the code that uses default credentials. We could unset the default credentials after testing.

Missing license headers

Related to #30, the LICENSE file has a header that is supposed to be used on the code files, but I'm not seeing that.

array_agg() fails because of TypeError

Version: 0.1.4

When doing array_agg() on a column of INTs, you get a TypeError. The same query works fine in Big Query UI.

q = """
SELECT a, array_agg(mean_rank) ranks, count(*) ct
FROM `table`
group by a 
"""
gbq.read_gbq(q, dialect='standard', verbose=False, project_id='project')

Another example query to reproduce:

select array_agg(a)
from
(select 1 a)

Error message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-b8c08e788608> in <module>()
      5 
      6 """
----> 7 gbq.read_gbq(q, dialect='standard', verbose=False, project_id='project')
      8

/home/jasonng/anaconda2/lib/python2.7/site-packages/pandas_gbq/gbq.pyc in read_gbq(query, project_id, index_col, col_order, reauth, verbose, private_key, dialect, **kwargs)
    725     while len(pages) > 0:
    726         page = pages.pop()
--> 727         dataframe_list.append(_parse_data(schema, page))
    728 
    729     if len(dataframe_list) > 0:

/home/jasonng/anaconda2/lib/python2.7/site-packages/pandas_gbq/gbq.pyc in _parse_data(schema, rows)
    621         for col_num, field_type in enumerate(col_types):
    622             field_value = _parse_entry(entries[col_num].get('v', ''),
--> 623                                        field_type)
    624             page_array[row_num][col_num] = field_value
    625 

/home/jasonng/anaconda2/lib/python2.7/site-packages/pandas_gbq/gbq.pyc in _parse_entry(field_value, field_type)
    631         return None
    632     if field_type == 'INTEGER':
--> 633         return int(field_value)
    634     elif field_type == 'FLOAT':
    635         return float(field_value)

TypeError: int() argument must be a string or a number, not 'list'

Replace oauth2client by google-auth

oauth2client is now deprecated. No more features will be added to the libraries and the core team is turning down support. We recommend you use google-auth and oauthlib.

CLN: use wait_for_job rather than sleep

code here: #24 (comment)
to replace sleep calls in tests.

def wait_for_job(job):

    # from https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/cloud-client/snippets.py

    while True:
        job.reload()  # Refreshes the state via a GET request.
        if job.state == 'DONE':
            if job.error_result:
                raise RuntimeError(job.errors)
            return
        logger.info("Waiting for {} to complete".format(job))
        time.sleep(1)

Add support for DATETIME BigQuery type

Recently a question was posted on Stackoverflow regarding support for DATETIME field type in BigQuery. I was wondering if it's possible to update the mapping that the method to_gbc does to also map DATETIME values.

Thanks!

Possible performance issue when reading large datasets from BigQuery

An issue was reported on StackOverflow regarding an issue related to downloading 1,000,000 rows from BigQuery. See https://stackoverflow.com/questions/44868111/failed-to-import-large-data-as-dataframe-from-google-bigquery-to-google-cloud-d

Secondly I use:

data = pd.read_gbq(query='SELECT {ABOUT_30_COLUMNS...} FROM TABLE_NAME LIMIT 1000000', dialect ='standard', project_id='PROJECT_ID')

It runs well at first, but when it goes to about 450,000 rows (calculate using percentage and total row count), it gets stuck at:

Got page: 32; 45.0% done. Elapsed 293.1 s.

We have an integration test for test_download_dataset_larger_than_200k_rows https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/tests/test_gbq.py#L709 . It may be helpful to also include a performance test (and increase the dataset size).

Request payload size exceeds the limit: 10485760 bytes.

I am writing a dataframe to pandas and have received this error:

pandas_gbq.gbq.GenericGBQException: Reason: badRequest, Message: Request payload size exceeds the limit: 10485760 bytes

Strangely I see the message that streaming insert was 100% complete:

Streaming Insert is 100% Complete

Traceback (most recent call last):
  File "...lib/python2.7/site-packages/pandas/core/frame.py", line 957, in to_gbq
    if_exists=if_exists, private_key=private_key)
  File "...lib/python2.7/site-packages/pandas/io/gbq.py", line 109, in to_gbq
    if_exists=if_exists, private_key=private_key)
  File "...lib/python2.7/site-packages/pandas_gbq/gbq.py", line 1056, in to_gbq
    connector.load_data(dataframe, dataset_id, table_id, chunksize)
  File "...lib/python2.7/site-packages/pandas_gbq/gbq.py", line 643, in load_data
    self.process_http_error(ex)
  File "...lib/python2.7/site-packages/pandas_gbq/gbq.py", line 454, in process_http_error
    "Reason: {0}, Message: {1}".format(reason, message))
pandas_gbq.gbq.GenericGBQException: Reason: badRequest, Message: Request payload size exceeds the limit: 10485760 bytes.

read_gbq() unnecessarily waiting on getting default credentials from Google

When attempting to grant pandas access to my GBQ project, I am running into an issue where read_gbq is trying to get default credentials, failing / timing out, then printing out a URL to go to to grant the credentials. Since I'm not running this on google cloud platform, I do not expect to be able to get default credentials. In my case, I only want to run the CLI flow (without having oauth call back to my local server).

Here's the code

>>> import pandas_gbq as gbq
>>> gbq.read_gbq('SELECT 1', project_id=<project_id>, auth_local_webserver=False)

Here's what I see when I trigger a SIGINT once the query is invoked:

  File "/usr/lib/python3.5/site-packages/pandas_gbq/gbq.py", line 214, in get_credentials
    credentials = self.get_application_default_credentials()
  File "/usr/lib/python3.5/site-packages/pandas_gbq/gbq.py", line 243, in get_application_default_credentials
    credentials, _ = google.auth.default(scopes=[self.scope])
  File "/usr/lib/python3.5/site-packages/google/auth/_default.py", line 277, in default
    credentials, project_id = checker()
  File "/usr/lib/python3.5/site-packages/google/auth/_default.py", line 274, in <lambda>
    lambda: _get_gce_credentials(request))
  File "/usr/lib/python3.5/site-packages/google/auth/_default.py", line 176, in _get_gce_credentials
    if _metadata.ping(request=request):
  File "/usr/lib/python3.5/site-packages/google/auth/compute_engine/_metadata.py", line 73, in ping
    timeout=timeout)
  File "/usr/lib/python3.5/site-packages/google/auth/transport/_http_client.py", line 103, in __call__
    method, path, body=body, headers=headers, **kwargs)
  File "/usr/lib/python3.5/http/client.py", line 1106, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python3.5/http/client.py", line 1151, in _send_request
    self.endheaders(body)
  File "/usr/lib/python3.5/http/client.py", line 1102, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python3.5/http/client.py", line 934, in _send_output
    self.send(msg)
  File "/usr/lib/python3.5/http/client.py", line 877, in send
    self.connect()
  File "/usr/lib/python3.5/http/client.py", line 849, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/usr/lib/python3.5/socket.py", line 702, in create_connection
    sock.connect(sa)
KeyboardInterrupt

I've also tried setting the env variable GOOGLE_APPLICATIONS_CREDENTIALS to empty. I'm using pandas-gbq version at commit 64a19b.

TST: Potentially add a unit test to verify the functionality of parameter `auth_local_webserver`

It may be possible to add a unit test to verify the functionality of parameter auth_local_webserver of read_gbq() and to_gbq() in #39 in order to improve testing. One potential solution is to capture the output in the unit test and look for specific text :

'Go to the following link in your browser:' , when auth_local_webserver=False
or
'Your browser has been opened to visit:' , when auth_local_webserver=True

Flake8 warnings

We need to resolve these warnings to please flake8:

tony@tonypc:~/pydata-pandas-gbq$ flake8 pandas_gbq/
pandas_gbq/gbq.py:3:1: I100 Import statements are in the wrong order. import json should be before from datetime
pandas_gbq/gbq.py:5:1: I100 Import statements are in the wrong order. import uuid should be before from time
pandas_gbq/gbq.py:6:1: I100 Import statements are in the wrong order. import time should be before import uuid
pandas_gbq/gbq.py:7:1: I100 Import statements are in the wrong order. import sys should be before import time
pandas_gbq/gbq.py:11:1: I100 Import statements are in the wrong order. from distutils.version should be before import numpy
pandas_gbq/gbq.py:12:1: I101 Imported names are in the wrong order. Should be DataFrame, compat, concat
pandas_gbq/gbq.py:12:1: I201 Missing newline before sections or imports.
pandas_gbq/gbq.py:13:1: I101 Imported names are in the wrong order. Should be bytes_to_str, lzip
pandas_gbq/_version.py:183:5: N806 variable in function should be lowercase
pandas_gbq/_version.py:224:5: N806 variable in function should be lowercase
pandas_gbq/_version.py:226:9: N806 variable in function should be lowercase
pandas_gbq/tests/test_gbq.py:3:1: I100 Import statements are in the wrong order. import re should be before import pytest
pandas_gbq/tests/test_gbq.py:5:1: I201 Missing newline before sections or imports.
pandas_gbq/tests/test_gbq.py:6:1: I100 Import statements are in the wrong order. from time should be before import pytz
pandas_gbq/tests/test_gbq.py:6:1: I201 Missing newline before sections or imports.
pandas_gbq/tests/test_gbq.py:7:1: I100 Import statements are in the wrong order. import os should be before from time
pandas_gbq/tests/test_gbq.py:9:1: I100 Import statements are in the wrong order. import logging should be before from random
pandas_gbq/tests/test_gbq.py:15:1: I101 Imported names are in the wrong order. Should be range, u
pandas_gbq/tests/test_gbq.py:16:1: I101 Imported names are in the wrong order. Should be DataFrame, NaT
pandas_gbq/tests/test_gbq.py:16:1: I100 Import statements are in the wrong order. from pandas should be before from pandas.compat
pandas_gbq/tests/test_gbq.py:17:1: I201 Missing newline before sections or imports.
pandas_gbq/tests/test_gbq.py:18:1: I100 Import statements are in the wrong order. import pandas.util.testing should be before from pandas_gbq
pandas_gbq/tests/test_gbq.py:18:1: I201 Missing newline before sections or imports.

to_gbq fails to append to table because of alleged schema mismatch

DF's schema returned by pandas.io.gbq.generate_bq_schema(df):

{'fields': [
{'type': 'TIMESTAMP', 'name': 'field1'}, 
{'type': 'STRING', 'name': 'field2'}, 
{'type': 'FLOAT', 'name': 'field3'}, 
{'type': 'STRING', 'name': 'field4'},
{'type': 'FLOAT', 'name': 'field5'},
{'type': 'STRING', 'name': 'field6'}, 
{'type': 'FLOAT', 'name': 'field7'}, 
{'type': 'STRING', 'name': 'field8'},
{'type': 'STRING', 'name': 'field9'}]}

Schema from BQ CLI:

|- field1: timestamp
|- field2: string
|- field3: float
|- field4: string
|- field5: float
|- field6: string
|- field7: float
|- field8: string
|- field9: string

to_gbq(df,"dataset.table","project",if_exists="append",private_key=service_key_path)
Throws:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/lib/python2.7/site-packages/pandas/io/gbq.py", line 827, in to_gbq
    raise InvalidSchema("Please verify that the structure and "
pandas.io.gbq.InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.

Structs lack proper names as dicts and arrays get turned into array of dicts

Version 0.1.4

This query returns a improperly named dict:

q = """
select struct(a,b) col
from
(SELECT 1 a, 2 b)
"""
df = gbq.read_gbq(q, dialect='standard', verbose=False)

image

Compare with result from Big Query:
image

An array of items also get turned into a arrays of dicts sometimes. For example:

q = """
select array_agg(a)
from
(select "1" a UNION ALL select "2" a)
"""
gbq.read_gbq(q, dialect='standard', verbose=False, project_id='project')

outputs:
image

Compare to Big Query:
image

These issues may or may not be related?

--noauth_local_webserver argument not being passed through on simple Big Query statement

I'm attempting to pass in the --noauth_local_webserver param via command line. If I dump out sys.argv, I can see that it's there. However, when pandas presents the URL to auth there is still a message about using that param if the browser is on a different host than where the script is running (which is true in my case).

Here's the script I'm using that is only in charge of setting up the token:

import pandas
pandas.read_gbq('select 1', 'my-project-name')

If I SIGINT it while it waits for the response, I see the call to parse the CLI args in the stack

  File "/usr/lib/python3.5/site-packages/pandas_gbq/gbq.py", line 234, in get_user_account_credentials
    credentials = run_flow(flow, storage, argparser.parse_args([]))

Which looks like it completely prevents the arg from ever being picked up. If I manually edit the file to remove the empty array, the script will then auth by requesting me to enter a verification code instead of listening on 8080 (which is what I want). I'm not a heavy user of this lib, so I'm not sure if taking out the empty array will cause other issues.

Would someone be able to see if this use-case for auth on Big Query could be supported in pandas? Or is there a better way to only get an oauth token for Big Query using pandas-gbq (where the script is running on a different host than the browser)?

pandas version: 0.20.2
pandas-gbq version: 0.1.6

TST: Add support for TOX

It could be helpful to setup tox for this repo. Once it is setup simply running tox at the root of the repo will run tests in both versions, as well as flake8.

GenericGBQException may be raised when listing dataset

I saw the following error today in a Travis-CI build log when listing tables under a dataset: 'GenericGBQException: Reason: notFound, Message: Not found: Token pandas_gbq_xxx'

@tswast also experienced this in #39 (comment)

Since the failure is intermittent we may be able to handle the 404 error from BQ in the first attempt and re-attempt the request to list tables under a dataset. I think we should raise GenericGBQException after a second attempt though and monitor for unit test failures.

==================================== ERRORS ====================================
 ERROR at teardown of TestToGBQIntegrationWithServiceAccountKeyPath.test_dataset_exists 

self = <pandas_gbq.gbq._Dataset object at 0x7f358b3736d8>

    def datasets(self):
        """ Return a list of datasets in Google BigQuery
    
            Parameters
            ----------
            None
    
            Returns
            -------
            list
                List of datasets under the specific project
            """
    
        dataset_list = []
        next_page_token = None
        first_query = True
    
        while first_query or next_page_token:
            first_query = False
    
            try:
                list_dataset_response = self.service.datasets().list(
                    projectId=self.project_id,
>                   pageToken=next_page_token).execute()

pandas_gbq/gbq.py:1247: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = (<googleapiclient.http.HttpRequest object at 0x7f358b352588>,)
kwargs = {}

    @functools.wraps(wrapped)
    def positional_wrapper(*args, **kwargs):
        if len(args) > max_positional_args:
            plural_s = ''
            if max_positional_args != 1:
                plural_s = 's'
            message = ('{function}() takes at most {args_max} positional '
                       'argument{plural} ({args_given} given)'.format(
                           function=wrapped.__name__,
                           args_max=max_positional_args,
                           args_given=len(args),
                           plural=plural_s))
            if positional_parameters_enforcement == POSITIONAL_EXCEPTION:
                raise TypeError(message)
            elif positional_parameters_enforcement == POSITIONAL_WARNING:
                logger.warning(message)
>       return wrapped(*args, **kwargs)

../../../miniconda/envs/test-environment/lib/python3.6/site-packages/oauth2client/_helpers.py:133: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <googleapiclient.http.HttpRequest object at 0x7f358b352588>
http = <google_auth_httplib2.AuthorizedHttp object at 0x7f358b378128>
num_retries = 0

    @util.positional(1)
    def execute(self, http=None, num_retries=0):
      """Execute the request.
    
        Args:
          http: httplib2.Http, an http object to be used in place of the
                one the HttpRequest request object was constructed with.
          num_retries: Integer, number of times to retry with randomized
                exponential backoff. If all retries fail, the raised HttpError
                represents the last request. If zero (default), we attempt the
                request only once.
    
        Returns:
          A deserialized object model of the response body as determined
          by the postproc.
    
        Raises:
          googleapiclient.errors.HttpError if the response was not a 2xx.
          httplib2.HttpLib2Error if a transport error has occured.
        """
      if http is None:
        http = self.http
    
      if self.resumable:
        body = None
        while body is None:
          _, body = self.next_chunk(http=http, num_retries=num_retries)
        return body
    
      # Non-resumable case.
    
      if 'content-length' not in self.headers:
        self.headers['content-length'] = str(self.body_size)
      # If the request URI is too long then turn it into a POST request.
      if len(self.uri) > MAX_URI_LENGTH and self.method == 'GET':
        self.method = 'POST'
        self.headers['x-http-method-override'] = 'GET'
        self.headers['content-type'] = 'application/x-www-form-urlencoded'
        parsed = urlparse(self.uri)
        self.uri = urlunparse(
            (parsed.scheme, parsed.netloc, parsed.path, parsed.params, None,
             None)
            )
        self.body = parsed.query
        self.headers['content-length'] = str(len(self.body))
    
      # Handle retries for server-side errors.
      resp, content = _retry_request(
            http, num_retries, 'request', self._sleep, self._rand, str(self.uri),
            method=str(self.method), body=self.body, headers=self.headers)
    
      for callback in self.response_callbacks:
        callback(resp)
      if resp.status >= 300:
>       raise HttpError(resp, content, uri=self.uri)
E       googleapiclient.errors.HttpError: <HttpError 404 when requesting https://www.googleapis.com/bigquery/v2/projects/[secure]/datasets?pageToken=pandas_gbq_923881&alt=json returned "Not found: Token pandas_gbq_923881">

../../../miniconda/envs/test-environment/lib/python3.6/site-packages/googleapiclient/http.py:840: HttpError

During handling of the above exception, another exception occurred:

self = <pandas_gbq.tests.test_gbq.TestToGBQIntegrationWithServiceAccountKeyPath object at 0x7f358b4293c8>
method = <bound method TestToGBQIntegrationWithServiceAccountKeyPath.test_dataset_exists of <pandas_gbq.tests.test_gbq.TestToGBQIntegrationWithServiceAccountKeyPath object at 0x7f358b4293c8>>

    def teardown_method(self, method):
        # - PER-TEST FIXTURES -
        # put here any instructions you want to be run *AFTER* *EVERY* test is
        # executed.
>       clean_gbq_environment(self.dataset_prefix, _get_private_key_path())

pandas_gbq/tests/test_gbq.py:949: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas_gbq/tests/test_gbq.py:116: in clean_gbq_environment
    all_datasets = dataset.datasets()
pandas_gbq/gbq.py:1263: in datasets
    self.process_http_error(ex)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

ex = <HttpError 404 when requesting https://www.googleapis.com/bigquery/v2/projects/[secure]/datasets?pageToken=pandas_gbq_923881&alt=json returned "Not found: Token pandas_gbq_923881">

    @staticmethod
    def process_http_error(ex):
        # See `BigQuery Troubleshooting Errors
        # <https://cloud.google.com/bigquery/troubleshooting-errors>`__
    
        status = json.loads(bytes_to_str(ex.content))['error']
        errors = status.get('errors', None)
    
        if errors:
            for error in errors:
                reason = error['reason']
                message = error['message']
    
                raise GenericGBQException(
>                   "Reason: {0}, Message: {1}".format(reason, message))
E               pandas_gbq.gbq.GenericGBQException: Reason: notFound, Message: Not found: Token pandas_gbq_923881

pandas_gbq/gbq.py:450: GenericGBQException

Scope should be configurable

I've been joining a table in BQ, to a federated one (looks like a BQ table, but its actually a Google Sheet). It throws the error: “Encountered an error while globbing file pattern”

I've found I needed to change:
scope = 'https://www.googleapis.com/auth/bigquery'
to:
scope = ['https://www.googleapis.com/auth/bigquery', 'https://www.googleapis.com/auth/drive']

though I suppose with other federated sources, the scope may need to be configurable.

Improve the user experience when Google credentials are expired or revoked

I think the user experience could be improved when Google credentials are expired or revoked. I see that we have a reauth parameter in read_gbq and to_gbq but this will cause the authentication to run each time even if the credentials are valid to cater for the environment where there are multiple users. It may be helpful to also have the ability to automatically trigger authentication flow when credentials are expired or revoked.

https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L771

Support replacing partitions of date-partitioned tables

This issue is about adding support to write to partitions of BigQuery's date-partitioned tables. Currently, reading is supported but inserts fails during execution of GbqConnector.delete_and_recreate: partition is deleted but creation fails due to existing table.

PR would include some refactoring to the current to_gbq function (e.g. getting rid of duplicate calls to BQ API).

Please see mremes/pandas-gbq for current changes. Currently not ready for a PR as I've encountered one issue with the BigQuery API (with subsequent create, delete, insertAll calls, insertAll's changes are not seen but after long waiting time).

User-provided schema as to_gbq parameter

gbq.to_gbq function currently does the schema conversion based on a given DataFrame's dtypes attribute, based on the dtype -> BQ data type map. This reflection is then passed as a fields value in BQ API call.

I'd like to propose that it should be possible to include a schema as an argument in the to_gbq call. This would save the users from painful, unnecessary dtype conversion as BQ API does the same thing again on create operation, and is not provided a schema on append loadAll operation.

@jreback what do you think? I have started with the implementation in my fork's feature branch.

KeyError in pandas.io.gbq.read_gbq when no DataFrame should be returned

Problem description

I am trying to use the read_gbq function from pandas.io.gbq as a means to execute INSERT or UPDATE queries on BigQuery. These queries do not return anything and so there is nothing for pandas to retrieve and return to my program, and so my expected output for something like this would be that no dataframe is returned (ie. it should ideally return None), instead of throwing a KeyError despite the fact that the INSERT or UPDATE statement executed successfully.

Code Sample

from pandas.io import gbq

insert_or_update_statement = r"""
INSERT INTO `MyBigQueryTable` (ColumnA, ColumnB, ColumnC) (
    SELECT ColumnA, ColumnB, ColumnC
    FROM `AnotherBigQueryTable`
    WHERE ColumnD = 4
        AND ColumnC = ColumnE + 1
)"""
gbq.read_gbq(insert_or_update_statement, project_id='my-project', dialect='standard')

Actual vs Expected Output

Expected that None would be returned for INSERT or UPDATE statements that don't have any results to retrieve. Actual output is as follows:

    453             self._print('Retrieving results...')
    454 
--> 455         total_rows = int(query_reply['totalRows'])
    456         result_pages = list()
    457         seen_page_tokens = list()

KeyError: 'totalRows'

Current Workaround

from pandas.io import gbq

# The statement executes successfully even though a KeyError is raised.
try:
    gbq.read_gbq(insert_or_update_statement, project_id='my-project', dialect='standard')
except KeyError:
    pass

Proposed Solution

This could be fixed in a fairly simple manner by catching the KeyError right there, skipping the row retrieval parts, and returning None. I would be happy to create a pull request to resolve this.

StreamingInsertError occurs when uploading to a table with a new schema

As mentioned in #74, around July 11th pandas-gbq builds started failing this test: test_gbq.py::TestToGBQIntegrationWithServiceAccountKeyPath::test_upload_data_if_table_exists_replace.

I reviewed the test failure and my initial thought is that there was change made in the BigQuery backend recently that triggered this. The issue is related to deleting and recreating a table with a different schema. Currently we force a delay of 2 minutes when a table with a modified schema is recreated. This delay is suggested in this StackOverflow post and this entry in the BigQuery issue tracker . Based on my limited testing, it seems that in addition to waiting 2 minutes, you also need to upload the data twice in order to see the data in BigQuery. During the first upload StreamingInsertError is raised. The second upload is successful.

You can easily confirm this when running the test locally. The test failure no longer appears when I change

        connector.load_data(dataframe, dataset_id, table_id, chunksize)

at
https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L1056
to

    try:
        connector.load_data(dataframe, dataset_id, table_id, chunksize)
    except:
        connector.load_data(dataframe, dataset_id, table_id, chunksize)

Based on this behaviour, I believe that now you need to upload data twice after changing the schema. It seems like this issue could be a regression on the BigQuery side (since re-uploading data wasn't required before).

I was also able to create this issue with the google-cloud-bigquery package with the following code:

from google.cloud import bigquery
from google.cloud.bigquery import SchemaField
import time

client = bigquery.Client(project=<your_project_id>)

dataset = client.dataset('test_dataset')
if not dataset.exists():
    dataset.create()

SCHEMA = [
    SchemaField('full_name', 'STRING', mode='required'),
    SchemaField('age', 'INTEGER', mode='required'),
]

table = dataset.table('test_table', SCHEMA)

if table.exists:
    try:
        table.delete()
    except:
        pass
    
table.create()
ROWS_TO_INSERT = [
    (u'Phred Phlyntstone', 32),
    (u'Wylma Phlyntstone', 29),
]
table.insert_data(ROWS_TO_INSERT)

# Now change the schema
SCHEMA = [
    SchemaField('name', 'STRING', mode='required'),
    SchemaField('age', 'STRING', mode='required'),
]
table = dataset.table('test_table', SCHEMA)

# Delete the table, wait 2 minutes and re-create the table
table.delete()
time.sleep(120)
table.create()

ROWS_TO_INSERT = [
    (u'Phred Phlyntstone', '32'),
    (u'Wylma Phlyntstone', '29'),
]
for _ in range(5):
    insert_errors = table.insert_data(ROWS_TO_INSERT)
    if len(insert_errors):
        print(insert_errors)
        print('Retrying')
    else:
        break

The output was :

>>[{'index': 0, 'errors': [{u'debugInfo': u'generic::not_found: no such field.', u'reason': u'invalid', u'message': u'no such field.', u'location': u'name'}]}, {'index': 1, 'errors': [{u'debugInfo': u'generic::not_found: no such field.', u'reason': u'invalid', u'message': u'no such field.', u'location': u'name'}]}]
>>Retrying

but prior to July 11th (or so) the retry wasn't required.

One thing that google-cloud-bigquery does is return streaming insert errors rather than raising StreamingInsertError like we do in pandas-gbq. See https://github.com/GoogleCloudPlatform/google-cloud-python/blob/master/bigquery/google/cloud/bigquery/table.py#L826 .

We could follow a similar behaviour and add a return in to_gbq which contains the streaming insert errors rather than raising StreamingInsertError. We can leave it up to the user to check for streaming insert errors and retry if needed https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L1056

internalError from to_gbq when service account doesn't have 'can edit' rights to BQ dataset

Command:

to_gbq(df, destination_table='dataset_table',project_id='foo',verbose=False,if_exists='replace',private_key='path/to/key')

Output:

GenericGBQException: Reason: internalError, Message: An internal error occurred and the request could not be completed.

I guess there is some other error metadata included in the API response but it's not printed with the exception message.

How to replicate?

  • try to call to_gbq to a table with a service account which doesn't have rights to destination table's dataset

When I added the rights for a dataset, DF got smoothly written into a table.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.