googleapis / python-bigquery-pandas Goto Github PK

View Code? Open in Web Editor NEW

421.0 63.0 118.0 1.34 MB

Google BigQuery connector for pandas

Home Page: https://googleapis.dev/python/pandas-gbq/latest/index.html

License: BSD 3-Clause "New" or "Revised" License

Python 88.83% Shell 10.43% Dockerfile 0.74%

pandas bigquery data

python-bigquery-pandas's Introduction

pandas-gbq

pandas-gbq is a package providing an interface to the Google BigQuery API from pandas.

Installation

Install latest release version via conda

$ conda install pandas-gbq --channel conda-forge

Install latest release version via pip

$ pip install pandas-gbq

Install latest development version

$ pip install git+https://github.com/googleapis/python-bigquery-pandas.git

Usage

Perform a query

import pandas_gbq

result_dataframe = pandas_gbq.read_gbq("SELECT column FROM dataset.table WHERE value = 'something'")

Upload a dataframe

import pandas_gbq

pandas_gbq.to_gbq(dataframe, "dataset.table")

More samples

See the pandas-gbq documentation for more details.

python-bigquery-pandas's People

Contributors

Stargazers

Watchers

Forkers

rtbhouse max-sixty jasonqng mr-mcox parthea tswast xcompass jreback lfiaschi amirhormati chris-boson mremes txomon chayac hagino3000 mulby tactileentertainment jonathansp cneerdaels robfraz tsdlovell madhav-datt stoltzmaniac guillermogsjc w3ss duncannewzoo rkdotsreepriya kzwkt aloosley apintoj21 blose murdrae aswanipranjal umair-gujjar rhoboro aktech ikpoe eeeeeeeeeeeeeeeeeeeieeeeeeeeeeeeeeeeee heoa jwboral rajuthermofisher oskara juregrom gabepeterdactyl rutgerhofste anthonydelage robertlacok smith-m cbandy mbrukman wolfws bwanglzu romualdk20 johnpaton daureg pbadeer digitalcpr dkapitan aaronmak exp-time-series-tools davidesarra mech-mocha maxliu dakl harir91 pablobuchu joshtemple wisesight abcampbell shantanukumar afuntw d3v3l0 dwreeves yanyang729 revision-autonomy codeur66 greenmtnboy vreyespue swipswaps alessandrolacorte skatsuta alonizhak michelangelo367 gkorland irfanhabib joaocarabetta renovate-bot tonyabraham116 jimfulton plamut slowy07 isabella232 python-repository-hub anhmike arx1718 acarmel steffnay wangjoshuah samarjeetkaur aribray

python-bigquery-pandas's Issues

TST: flake8 should run on all python builds

From PR #39 ,
I received the following warning when running flake8 on branch tswast:gbq-37 but Travis did not report the warning.

Local run:

tony@tonypc:~/pydata-pandas-gbq$ flake8 pandas_gbq 
pandas_gbq/tests/test_gbq.py:136:61: F841 local variable 'e' is assigned to but never used

On Travis:
https://travis-ci.org/tswast/pandas-gbq/jobs/234914390

Google auth error - invalid_grant: Token has been expired or revoked

From #76 , unit test TestGBQConnectorIntegrationWithLocalUserAccountAuth.test_get_user_account_credentials_bad_file_returns_credentials was failing for @hagino3000 with invalid_grant: Token has been expired or revoked. I believe I've seen this error also. I'll try to reproduce this.

========================================================================== FAILURES ===========================================================================
_____________________ TestGBQConnectorIntegrationWithLocalUserAccountAuth.test_get_user_account_credentials_bad_file_returns_credentials ______________________

self = <pandas_gbq.tests.test_gbq.TestGBQConnectorIntegrationWithLocalUserAccountAuth object at 0x11327ba90>

    def test_get_user_account_credentials_bad_file_returns_credentials(self):
        import mock
        from google.auth.credentials import Credentials
>       with mock.patch('__main__.open', side_effect=IOError()):

pandas_gbq/tests/test_gbq.py:231:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../env/lib/python3.6/site-packages/mock.py:1268: in __enter__
    original, local = self.get_original()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <mock._patch object at 0x112f1c978>

    def get_original(self):
        target = self.getter()
        name = self.attribute

        original = DEFAULT
        local = False

        try:
            original = target.__dict__[name]
        except (AttributeError, KeyError):
            original = getattr(target, name, DEFAULT)
        else:
            local = True

        if not self.create and original is DEFAULT:
            raise AttributeError(
>               "%s does not have the attribute %r" % (target, name)
            )
E           AttributeError: <module '__main__' from '/Users/t-nishibayashi/dev/workspace/BigQuery-Python-dev/env/bin/pytest'> does not have the attribute 'open
'

../../env/lib/python3.6/site-packages/mock.py:1242: AttributeError
__________________________ TestGBQConnectorIntegrationWithLocalUserAccountAuth.test_get_user_account_credentials_returns_credentials __________________________

self = <pandas_gbq.tests.test_gbq.TestGBQConnectorIntegrationWithLocalUserAccountAuth object at 0x11301d940>

    def test_get_user_account_credentials_returns_credentials(self):
        from google.auth.credentials import Credentials
>       credentials = self.sut.get_user_account_credentials()

pandas_gbq/tests/test_gbq.py:237:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas_gbq/gbq.py:340: in get_user_account_credentials
    credentials = self.load_user_account_credentials()
pandas_gbq/gbq.py:298: in load_user_account_credentials
    credentials.refresh(request)
../../env/lib/python3.6/site-packages/google/oauth2/credentials.py:126: in refresh
    self._client_secret))
../../env/lib/python3.6/site-packages/google/oauth2/_client.py:189: in refresh_grant
    response_data = _token_endpoint_request(request, token_uri, body)
../../env/lib/python3.6/site-packages/google/oauth2/_client.py:109: in _token_endpoint_request
    _handle_error_response(response_body)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

response_body = '{\n  "error" : "invalid_grant",\n  "error_description" : "Token has been expired or revoked."\n}'

    def _handle_error_response(response_body):
        """"Translates an error response into an exception.

        Args:
            response_body (str): The decoded response data.

        Raises:
            google.auth.exceptions.RefreshError
        """
        try:
            error_data = json.loads(response_body)
            error_details = '{}: {}'.format(
                error_data['error'],
                error_data.get('error_description'))
        # If no details could be extracted, use the response data.
        except (KeyError, ValueError):
            error_details = response_body

        raise exceptions.RefreshError(
>           error_details, response_body)
E       google.auth.exceptions.RefreshError: ('invalid_grant: Token has been expired or revoked.', '{\n  "error" : "invalid_grant",\n  "error_description" : "T
oken has been expired or revoked."\n}')

../../env/lib/python3.6/site-packages/google/oauth2/_client.py:59: RefreshError

TST: Improve unit tests for GoogleCredentials.get_application_default()

The test test_get_application_default_credentials_returns_credentials() : https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/tests/test_gbq.py#L297 is skipped if _check_if_can_get_correct_default_credentials() returns false. This happens when default credentials are not available. In order to improve test coverage, specifically on Travis, we could (temporarily) populate the default credentials during testing in order to exercise the code that uses default credentials. We could unset the default credentials after testing.

nothing here

Missing license headers

Related to #30, the LICENSE file has a header that is supposed to be used on the code files, but I'm not seeing that.

Printing rather than logging?

We're printing in addition to logging, when querying from BigQuery. This makes controlling the output much harder, aside from being un-idiomatic.

Printing in white, logging in red:

https://cloud.githubusercontent.com/assets/5635139/23176541/6028b884-f831-11e6-911a-48aa7741a4da.png

Split LICENSE file into separate LICENSE and AUTHORS files.

The GitHub robot (Ruby Licenseee Gem) isn't properly detecting the license as BSD simplified.

Seems to me that the About the Copyright Holders and Our Copyright Policy sections could be pulled into an AUTHORS.md file. The LICENSE file could then be the generic BSD license, which is more easily detected by scrapers such as GitHub's logic.

Provide a way to override the user account token location.

The user account token location is hard-coded to bigquery_credentials.dat. There should be a way to override this location.

setup gbq contributor guidlines

like this: http://pandas.pydata.org/pandas-docs/stable/contributing.html#running-google-bigquery-integration-tests

improve coverage

https://codecov.io/gh/pydata/pandas-gbq :<

RLS: 0.2.0

@parthea can you do this release?

whatsnew needs updating to the date of the release and otherwise follow the release-procedure.

also some things failing which :<

https://travis-ci.org/pydata/pandas-gbq/builds/256434923

credential travis build

https://travis-ci.org/pydata/pandas-gbq already setup but needs to run full tests with credentials.

release on PyPi (current version)

initially tagged 0.1.0 already, so could be 0.1.1 (or 0.2.0)

array_agg() fails because of TypeError

Version: 0.1.4

When doing array_agg() on a column of INTs, you get a TypeError. The same query works fine in Big Query UI.

q = """
SELECT a, array_agg(mean_rank) ranks, count(*) ct
FROM `table`
group by a 
"""
gbq.read_gbq(q, dialect='standard', verbose=False, project_id='project')

Another example query to reproduce:

select array_agg(a)
from
(select 1 a)

Error message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-b8c08e788608> in <module>()
      5 
      6 """
----> 7 gbq.read_gbq(q, dialect='standard', verbose=False, project_id='project')
      8

/home/jasonng/anaconda2/lib/python2.7/site-packages/pandas_gbq/gbq.pyc in read_gbq(query, project_id, index_col, col_order, reauth, verbose, private_key, dialect, **kwargs)
    725     while len(pages) > 0:
    726         page = pages.pop()
--> 727         dataframe_list.append(_parse_data(schema, page))
    728 
    729     if len(dataframe_list) > 0:

/home/jasonng/anaconda2/lib/python2.7/site-packages/pandas_gbq/gbq.pyc in _parse_data(schema, rows)
    621         for col_num, field_type in enumerate(col_types):
    622             field_value = _parse_entry(entries[col_num].get('v', ''),
--> 623                                        field_type)
    624             page_array[row_num][col_num] = field_value
    625 

/home/jasonng/anaconda2/lib/python2.7/site-packages/pandas_gbq/gbq.pyc in _parse_entry(field_value, field_type)
    631         return None
    632     if field_type == 'INTEGER':
--> 633         return int(field_value)
    634     elif field_type == 'FLOAT':
    635         return float(field_value)

TypeError: int() argument must be a string or a number, not 'list'

Replace oauth2client by google-auth

oauth2client is now deprecated. No more features will be added to the libraries and the core team is turning down support. We recommend you use google-auth and oauthlib.

CLN: use wait_for_job rather than sleep

code here: #24 (comment)
to replace sleep calls in tests.

def wait_for_job(job):

    # from https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/cloud-client/snippets.py

    while True:
        job.reload()  # Refreshes the state via a GET request.
        if job.state == 'DONE':
            if job.error_result:
                raise RuntimeError(job.errors)
            return
        logger.info("Waiting for {} to complete".format(job))
        time.sleep(1)

Add support for DATETIME BigQuery type

Recently a question was posted on Stackoverflow regarding support for DATETIME field type in BigQuery. I was wondering if it's possible to update the mapping that the method to_gbc does to also map DATETIME values.

Thanks!

Bug: noauth_local_webserver flag not propagate to oauthlib library

This happens when calling initializing the bigquery_credentials.dat file. The auth flow should be different when calling the script with this flag.

DOC: Remove references to oauth2client. Add documentation for installing google-auth*

oauth2client was removed in #39 in favour of google-auth*. We need to remove references to oauth2client in the docs and add conda install steps google-auth*.

cc @tswast

with BIGINT OverflowError: Python int too large to convert to C long

When doing a query with BIGINT Google, we get this message:

with BIGINT OverflowError: Python int too large to convert to C long

account_id=5742796208078848

In pandas, we dont see this bug.

Possible performance issue when reading large datasets from BigQuery

An issue was reported on StackOverflow regarding an issue related to downloading 1,000,000 rows from BigQuery. See https://stackoverflow.com/questions/44868111/failed-to-import-large-data-as-dataframe-from-google-bigquery-to-google-cloud-d

Secondly I use:

data = pd.read_gbq(query='SELECT {ABOUT_30_COLUMNS...} FROM TABLE_NAME LIMIT 1000000', dialect ='standard', project_id='PROJECT_ID')

It runs well at first, but when it goes to about 450,000 rows (calculate using percentage and total row count), it gets stuck at:

Got page: 32; 45.0% done. Elapsed 293.1 s.

We have an integration test for test_download_dataset_larger_than_200k_rows https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/tests/test_gbq.py#L709 . It may be helpful to also include a performance test (and increase the dataset size).

Request payload size exceeds the limit: 10485760 bytes.

I am writing a dataframe to pandas and have received this error:

pandas_gbq.gbq.GenericGBQException: Reason: badRequest, Message: Request payload size exceeds the limit: 10485760 bytes

Strangely I see the message that streaming insert was 100% complete:

Streaming Insert is 100% Complete

Traceback (most recent call last):
  File "...lib/python2.7/site-packages/pandas/core/frame.py", line 957, in to_gbq
    if_exists=if_exists, private_key=private_key)
  File "...lib/python2.7/site-packages/pandas/io/gbq.py", line 109, in to_gbq
    if_exists=if_exists, private_key=private_key)
  File "...lib/python2.7/site-packages/pandas_gbq/gbq.py", line 1056, in to_gbq
    connector.load_data(dataframe, dataset_id, table_id, chunksize)
  File "...lib/python2.7/site-packages/pandas_gbq/gbq.py", line 643, in load_data
    self.process_http_error(ex)
  File "...lib/python2.7/site-packages/pandas_gbq/gbq.py", line 454, in process_http_error
    "Reason: {0}, Message: {1}".format(reason, message))
pandas_gbq.gbq.GenericGBQException: Reason: badRequest, Message: Request payload size exceeds the limit: 10485760 bytes.

remove version added references in doc-strings

Provide a way to override requested scopes

Additional scopes are required when using BigQuery to access external data sources such as Google Drive or Bigtable.

It looks like the GbqConnector class currently only uses the https://www.googleapis.com/auth/bigquery scope and does not override the scopes in the constructor.

read_gbq() unnecessarily waiting on getting default credentials from Google

When attempting to grant pandas access to my GBQ project, I am running into an issue where read_gbq is trying to get default credentials, failing / timing out, then printing out a URL to go to to grant the credentials. Since I'm not running this on google cloud platform, I do not expect to be able to get default credentials. In my case, I only want to run the CLI flow (without having oauth call back to my local server).

Here's the code

>>> import pandas_gbq as gbq
>>> gbq.read_gbq('SELECT 1', project_id=<project_id>, auth_local_webserver=False)

Here's what I see when I trigger a SIGINT once the query is invoked:

  File "/usr/lib/python3.5/site-packages/pandas_gbq/gbq.py", line 214, in get_credentials
    credentials = self.get_application_default_credentials()
  File "/usr/lib/python3.5/site-packages/pandas_gbq/gbq.py", line 243, in get_application_default_credentials
    credentials, _ = google.auth.default(scopes=[self.scope])
  File "/usr/lib/python3.5/site-packages/google/auth/_default.py", line 277, in default
    credentials, project_id = checker()
  File "/usr/lib/python3.5/site-packages/google/auth/_default.py", line 274, in <lambda>
    lambda: _get_gce_credentials(request))
  File "/usr/lib/python3.5/site-packages/google/auth/_default.py", line 176, in _get_gce_credentials
    if _metadata.ping(request=request):
  File "/usr/lib/python3.5/site-packages/google/auth/compute_engine/_metadata.py", line 73, in ping
    timeout=timeout)
  File "/usr/lib/python3.5/site-packages/google/auth/transport/_http_client.py", line 103, in __call__
    method, path, body=body, headers=headers, **kwargs)
  File "/usr/lib/python3.5/http/client.py", line 1106, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python3.5/http/client.py", line 1151, in _send_request
    self.endheaders(body)
  File "/usr/lib/python3.5/http/client.py", line 1102, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python3.5/http/client.py", line 934, in _send_output
    self.send(msg)
  File "/usr/lib/python3.5/http/client.py", line 877, in send
    self.connect()
  File "/usr/lib/python3.5/http/client.py", line 849, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/usr/lib/python3.5/socket.py", line 702, in create_connection
    sock.connect(sa)
KeyboardInterrupt

I've also tried setting the env variable GOOGLE_APPLICATIONS_CREDENTIALS to empty. I'm using pandas-gbq version at commit 64a19b.

TST: Potentially add a unit test to verify the functionality of parameter `auth_local_webserver`

It may be possible to add a unit test to verify the functionality of parameter auth_local_webserver of read_gbq() and to_gbq() in #39 in order to improve testing. One potential solution is to capture the output in the unit test and look for specific text :

'Go to the following link in your browser:' , when auth_local_webserver=False
or
'Your browser has been opened to visit:' , when auth_local_webserver=True

Flake8 warnings

We need to resolve these warnings to please flake8:

tony@tonypc:~/pydata-pandas-gbq$ flake8 pandas_gbq/
pandas_gbq/gbq.py:3:1: I100 Import statements are in the wrong order. import json should be before from datetime
pandas_gbq/gbq.py:5:1: I100 Import statements are in the wrong order. import uuid should be before from time
pandas_gbq/gbq.py:6:1: I100 Import statements are in the wrong order. import time should be before import uuid
pandas_gbq/gbq.py:7:1: I100 Import statements are in the wrong order. import sys should be before import time
pandas_gbq/gbq.py:11:1: I100 Import statements are in the wrong order. from distutils.version should be before import numpy
pandas_gbq/gbq.py:12:1: I101 Imported names are in the wrong order. Should be DataFrame, compat, concat
pandas_gbq/gbq.py:12:1: I201 Missing newline before sections or imports.
pandas_gbq/gbq.py:13:1: I101 Imported names are in the wrong order. Should be bytes_to_str, lzip
pandas_gbq/_version.py:183:5: N806 variable in function should be lowercase
pandas_gbq/_version.py:224:5: N806 variable in function should be lowercase
pandas_gbq/_version.py:226:9: N806 variable in function should be lowercase
pandas_gbq/tests/test_gbq.py:3:1: I100 Import statements are in the wrong order. import re should be before import pytest
pandas_gbq/tests/test_gbq.py:5:1: I201 Missing newline before sections or imports.
pandas_gbq/tests/test_gbq.py:6:1: I100 Import statements are in the wrong order. from time should be before import pytz
pandas_gbq/tests/test_gbq.py:6:1: I201 Missing newline before sections or imports.
pandas_gbq/tests/test_gbq.py:7:1: I100 Import statements are in the wrong order. import os should be before from time
pandas_gbq/tests/test_gbq.py:9:1: I100 Import statements are in the wrong order. import logging should be before from random
pandas_gbq/tests/test_gbq.py:15:1: I101 Imported names are in the wrong order. Should be range, u
pandas_gbq/tests/test_gbq.py:16:1: I101 Imported names are in the wrong order. Should be DataFrame, NaT
pandas_gbq/tests/test_gbq.py:16:1: I100 Import statements are in the wrong order. from pandas should be before from pandas.compat
pandas_gbq/tests/test_gbq.py:17:1: I201 Missing newline before sections or imports.
pandas_gbq/tests/test_gbq.py:18:1: I100 Import statements are in the wrong order. import pandas.util.testing should be before from pandas_gbq
pandas_gbq/tests/test_gbq.py:18:1: I201 Missing newline before sections or imports.

Improve code coverage when gbq credentials are not provided

By default, the majority of pandas-gbq unit tests are skipped on forks of pandas-gbq because Google BigQuery integration testing requires credentials. This causes code coverage to appear low on pull requests from forks. One potential way to increase the code coverage reported in pull requests from forks is to use mock to stub external requests.
https://docs.python.org/3.6/library/unittest.mock.html

to_gbq fails to append to table because of alleged schema mismatch

DF's schema returned by pandas.io.gbq.generate_bq_schema(df):

{'fields': [
{'type': 'TIMESTAMP', 'name': 'field1'}, 
{'type': 'STRING', 'name': 'field2'}, 
{'type': 'FLOAT', 'name': 'field3'}, 
{'type': 'STRING', 'name': 'field4'},
{'type': 'FLOAT', 'name': 'field5'},
{'type': 'STRING', 'name': 'field6'}, 
{'type': 'FLOAT', 'name': 'field7'}, 
{'type': 'STRING', 'name': 'field8'},
{'type': 'STRING', 'name': 'field9'}]}

Schema from BQ CLI:

|- field1: timestamp
|- field2: string
|- field3: float
|- field4: string
|- field5: float
|- field6: string
|- field7: float
|- field8: string
|- field9: string

to_gbq(df,"dataset.table","project",if_exists="append",private_key=service_key_path)
Throws:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/lib/python2.7/site-packages/pandas/io/gbq.py", line 827, in to_gbq
    raise InvalidSchema("Please verify that the structure and "
pandas.io.gbq.InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.

Structs lack proper names as dicts and arrays get turned into array of dicts

Version 0.1.4

This query returns a improperly named dict:

q = """
select struct(a,b) col
from
(SELECT 1 a, 2 b)
"""
df = gbq.read_gbq(q, dialect='standard', verbose=False)

Compare with result from Big Query:

An array of items also get turned into a arrays of dicts sometimes. For example:

q = """
select array_agg(a)
from
(select "1" a UNION ALL select "2" a)
"""
gbq.read_gbq(q, dialect='standard', verbose=False, project_id='project')

outputs:

Compare to Big Query:

These issues may or may not be related?

DOC: Improve the documentation for running integration tests locally

I'd love to see a TESTING.md file that showed how to run the tests locally before sending off a PR.

I can run pytest but is that running the integration tests, too? It seemed to run a bit fast for that to be the case.

--noauth_local_webserver argument not being passed through on simple Big Query statement

I'm attempting to pass in the --noauth_local_webserver param via command line. If I dump out sys.argv, I can see that it's there. However, when pandas presents the URL to auth there is still a message about using that param if the browser is on a different host than where the script is running (which is true in my case).

Here's the script I'm using that is only in charge of setting up the token:

import pandas
pandas.read_gbq('select 1', 'my-project-name')

If I SIGINT it while it waits for the response, I see the call to parse the CLI args in the stack

  File "/usr/lib/python3.5/site-packages/pandas_gbq/gbq.py", line 234, in get_user_account_credentials
    credentials = run_flow(flow, storage, argparser.parse_args([]))

Which looks like it completely prevents the arg from ever being picked up. If I manually edit the file to remove the empty array, the script will then auth by requesting me to enter a verification code instead of listening on 8080 (which is what I want). I'm not a heavy user of this lib, so I'm not sure if taking out the empty array will cause other issues.

Would someone be able to see if this use-case for auth on Big Query could be supported in pandas? Or is there a better way to only get an oauth token for Big Query using pandas-gbq (where the script is running on a different host than the browser)?

pandas version: 0.20.2
pandas-gbq version: 0.1.6

TST: failing tests

seems our tests have been failing for a bit: https://travis-ci.org/pydata/pandas-gbq/builds/233574394

any ideas @parthea ?

add to conda-forge feedstock

setup readthedocs

Feature Request: Add support for 'Allow Large Results' to BigQuery connector

xref pandas-dev/pandas#10474

gbq.py currently returns an error if the result of a query is what Google considers to be 'Large'. The google api allows jobs to be sent with a flag to allow large results. It would be very beneficial to provide this as an option in the BigQuery connector.

TST: Add support for TOX

It could be helpful to setup tox for this repo. Once it is setup simply running tox at the root of the repo will run tests in both versions, as well as flake8.

Bulk upload

ref: pandas-dev/pandas#14670

R does this well, and it makes a material performance impact

GenericGBQException may be raised when listing dataset

I saw the following error today in a Travis-CI build log when listing tables under a dataset: 'GenericGBQException: Reason: notFound, Message: Not found: Token pandas_gbq_xxx'

@tswast also experienced this in #39 (comment)

Since the failure is intermittent we may be able to handle the 404 error from BQ in the first attempt and re-attempt the request to list tables under a dataset. I think we should raise GenericGBQException after a second attempt though and monitor for unit test failures.

==================================== ERRORS ====================================
 ERROR at teardown of TestToGBQIntegrationWithServiceAccountKeyPath.test_dataset_exists 

self = <pandas_gbq.gbq._Dataset object at 0x7f358b3736d8>

    def datasets(self):
        """ Return a list of datasets in Google BigQuery
    
            Parameters
            ----------
            None
    
            Returns
            -------
            list
                List of datasets under the specific project
            """
    
        dataset_list = []
        next_page_token = None
        first_query = True
    
        while first_query or next_page_token:
            first_query = False
    
            try:
                list_dataset_response = self.service.datasets().list(
                    projectId=self.project_id,
>                   pageToken=next_page_token).execute()

pandas_gbq/gbq.py:1247: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = (<googleapiclient.http.HttpRequest object at 0x7f358b352588>,)
kwargs = {}

    @functools.wraps(wrapped)
    def positional_wrapper(*args, **kwargs):
        if len(args) > max_positional_args:
            plural_s = ''
            if max_positional_args != 1:
                plural_s = 's'
            message = ('{function}() takes at most {args_max} positional '
                       'argument{plural} ({args_given} given)'.format(
                           function=wrapped.__name__,
                           args_max=max_positional_args,
                           args_given=len(args),
                           plural=plural_s))
            if positional_parameters_enforcement == POSITIONAL_EXCEPTION:
                raise TypeError(message)
            elif positional_parameters_enforcement == POSITIONAL_WARNING:
                logger.warning(message)
>       return wrapped(*args, **kwargs)

../../../miniconda/envs/test-environment/lib/python3.6/site-packages/oauth2client/_helpers.py:133: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <googleapiclient.http.HttpRequest object at 0x7f358b352588>
http = <google_auth_httplib2.AuthorizedHttp object at 0x7f358b378128>
num_retries = 0

    @util.positional(1)
    def execute(self, http=None, num_retries=0):
      """Execute the request.
    
        Args:
          http: httplib2.Http, an http object to be used in place of the
                one the HttpRequest request object was constructed with.
          num_retries: Integer, number of times to retry with randomized
                exponential backoff. If all retries fail, the raised HttpError
                represents the last request. If zero (default), we attempt the
                request only once.
    
        Returns:
          A deserialized object model of the response body as determined
          by the postproc.
    
        Raises:
          googleapiclient.errors.HttpError if the response was not a 2xx.
          httplib2.HttpLib2Error if a transport error has occured.
        """
      if http is None:
        http = self.http
    
      if self.resumable:
        body = None
        while body is None:
          _, body = self.next_chunk(http=http, num_retries=num_retries)
        return body
    
      # Non-resumable case.
    
      if 'content-length' not in self.headers:
        self.headers['content-length'] = str(self.body_size)
      # If the request URI is too long then turn it into a POST request.
      if len(self.uri) > MAX_URI_LENGTH and self.method == 'GET':
        self.method = 'POST'
        self.headers['x-http-method-override'] = 'GET'
        self.headers['content-type'] = 'application/x-www-form-urlencoded'
        parsed = urlparse(self.uri)
        self.uri = urlunparse(
            (parsed.scheme, parsed.netloc, parsed.path, parsed.params, None,
             None)
            )
        self.body = parsed.query
        self.headers['content-length'] = str(len(self.body))
    
      # Handle retries for server-side errors.
      resp, content = _retry_request(
            http, num_retries, 'request', self._sleep, self._rand, str(self.uri),
            method=str(self.method), body=self.body, headers=self.headers)
    
      for callback in self.response_callbacks:
        callback(resp)
      if resp.status >= 300:
>       raise HttpError(resp, content, uri=self.uri)
E       googleapiclient.errors.HttpError: <HttpError 404 when requesting https://www.googleapis.com/bigquery/v2/projects/[secure]/datasets?pageToken=pandas_gbq_923881&alt=json returned "Not found: Token pandas_gbq_923881">

../../../miniconda/envs/test-environment/lib/python3.6/site-packages/googleapiclient/http.py:840: HttpError

During handling of the above exception, another exception occurred:

self = <pandas_gbq.tests.test_gbq.TestToGBQIntegrationWithServiceAccountKeyPath object at 0x7f358b4293c8>
method = <bound method TestToGBQIntegrationWithServiceAccountKeyPath.test_dataset_exists of <pandas_gbq.tests.test_gbq.TestToGBQIntegrationWithServiceAccountKeyPath object at 0x7f358b4293c8>>

    def teardown_method(self, method):
        # - PER-TEST FIXTURES -
        # put here any instructions you want to be run *AFTER* *EVERY* test is
        # executed.
>       clean_gbq_environment(self.dataset_prefix, _get_private_key_path())

pandas_gbq/tests/test_gbq.py:949: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas_gbq/tests/test_gbq.py:116: in clean_gbq_environment
    all_datasets = dataset.datasets()
pandas_gbq/gbq.py:1263: in datasets
    self.process_http_error(ex)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

ex = <HttpError 404 when requesting https://www.googleapis.com/bigquery/v2/projects/[secure]/datasets?pageToken=pandas_gbq_923881&alt=json returned "Not found: Token pandas_gbq_923881">

    @staticmethod
    def process_http_error(ex):
        # See `BigQuery Troubleshooting Errors
        # <https://cloud.google.com/bigquery/troubleshooting-errors>`__
    
        status = json.loads(bytes_to_str(ex.content))['error']
        errors = status.get('errors', None)
    
        if errors:
            for error in errors:
                reason = error['reason']
                message = error['message']
    
                raise GenericGBQException(
>                   "Reason: {0}, Message: {1}".format(reason, message))
E               pandas_gbq.gbq.GenericGBQException: Reason: notFound, Message: Not found: Token pandas_gbq_923881

pandas_gbq/gbq.py:450: GenericGBQException

Scope should be configurable

I've been joining a table in BQ, to a federated one (looks like a BQ table, but its actually a Google Sheet). It throws the error: “Encountered an error while globbing file pattern”

I've found I needed to change:
scope = 'https://www.googleapis.com/auth/bigquery'
to:
scope = ['https://www.googleapis.com/auth/bigquery', 'https://www.googleapis.com/auth/drive']

though I suppose with other federated sources, the scope may need to be configurable.

Improve the user experience when Google credentials are expired or revoked

I think the user experience could be improved when Google credentials are expired or revoked. I see that we have a reauth parameter in read_gbq and to_gbq but this will cause the authentication to run each time even if the credentials are valid to cater for the environment where there are multiple users. It may be helpful to also have the ability to automatically trigger authentication flow when credentials are expired or revoked.

https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L771

Clarification on how this relates to pandas.io.gbq

I see that pandas.io.gbq is still in the Pandas master branch.

Is there criteria for when that will get removed in favor of this package?

DOC: Add conda install steps for google-auth*

From #62, we need to include steps in the pandas-gbq documentation on how to install the following:

google-auth

google-auth-httplib2

google-auth-oauthlib

Support replacing partitions of date-partitioned tables

This issue is about adding support to write to partitions of BigQuery's date-partitioned tables. Currently, reading is supported but inserts fails during execution of GbqConnector.delete_and_recreate: partition is deleted but creation fails due to existing table.

PR would include some refactoring to the current to_gbq function (e.g. getting rid of duplicate calls to BQ API).

Please see mremes/pandas-gbq for current changes. Currently not ready for a PR as I've encountered one issue with the BigQuery API (with subsequent create, delete, insertAll calls, insertAll's changes are not seen but after long waiting time).

User-provided schema as to_gbq parameter

gbq.to_gbq function currently does the schema conversion based on a given DataFrame's dtypes attribute, based on the dtype -> BQ data type map. This reflection is then passed as a fields value in BQ API call.

I'd like to propose that it should be possible to include a schema as an argument in the to_gbq call. This would save the users from painful, unnecessary dtype conversion as BQ API does the same thing again on create operation, and is not provided a schema on append loadAll operation.

@jreback what do you think? I have started with the implementation in my fork's feature branch.

Integration tests should fail rather than skip if project id is not set in Travis

Certain integration tests are skipped if a BigQuery project is not set in Travis. My initial thought is that the tests should fail if a BigQuery project id is not provided.

For reference, here are the steps to run the BigQuery integration tests on Travis:
https://pandas-gbq.readthedocs.io/en/latest/contributing.html#running-google-bigquery-integration-tests

KeyError in pandas.io.gbq.read_gbq when no DataFrame should be returned

Problem description

I am trying to use the read_gbq function from pandas.io.gbq as a means to execute INSERT or UPDATE queries on BigQuery. These queries do not return anything and so there is nothing for pandas to retrieve and return to my program, and so my expected output for something like this would be that no dataframe is returned (ie. it should ideally return None), instead of throwing a KeyError despite the fact that the INSERT or UPDATE statement executed successfully.

Code Sample

from pandas.io import gbq

insert_or_update_statement = r"""
INSERT INTO `MyBigQueryTable` (ColumnA, ColumnB, ColumnC) (
    SELECT ColumnA, ColumnB, ColumnC
    FROM `AnotherBigQueryTable`
    WHERE ColumnD = 4
        AND ColumnC = ColumnE + 1
)"""
gbq.read_gbq(insert_or_update_statement, project_id='my-project', dialect='standard')

Actual vs Expected Output

Expected that None would be returned for INSERT or UPDATE statements that don't have any results to retrieve. Actual output is as follows:

    453             self._print('Retrieving results...')
    454 
--> 455         total_rows = int(query_reply['totalRows'])
    456         result_pages = list()
    457         seen_page_tokens = list()

KeyError: 'totalRows'

Current Workaround

from pandas.io import gbq

# The statement executes successfully even though a KeyError is raised.
try:
    gbq.read_gbq(insert_or_update_statement, project_id='my-project', dialect='standard')
except KeyError:
    pass

Proposed Solution

This could be fixed in a fairly simple manner by catching the KeyError right there, skipping the row retrieval parts, and returning None. I would be happy to create a pull request to resolve this.

StreamingInsertError occurs when uploading to a table with a new schema

As mentioned in #74, around July 11th pandas-gbq builds started failing this test: test_gbq.py::TestToGBQIntegrationWithServiceAccountKeyPath::test_upload_data_if_table_exists_replace.

I reviewed the test failure and my initial thought is that there was change made in the BigQuery backend recently that triggered this. The issue is related to deleting and recreating a table with a different schema. Currently we force a delay of 2 minutes when a table with a modified schema is recreated. This delay is suggested in this StackOverflow post and this entry in the BigQuery issue tracker . Based on my limited testing, it seems that in addition to waiting 2 minutes, you also need to upload the data twice in order to see the data in BigQuery. During the first upload StreamingInsertError is raised. The second upload is successful.

You can easily confirm this when running the test locally. The test failure no longer appears when I change

        connector.load_data(dataframe, dataset_id, table_id, chunksize)

at
https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L1056
to

    try:
        connector.load_data(dataframe, dataset_id, table_id, chunksize)
    except:
        connector.load_data(dataframe, dataset_id, table_id, chunksize)

Based on this behaviour, I believe that now you need to upload data twice after changing the schema. It seems like this issue could be a regression on the BigQuery side (since re-uploading data wasn't required before).

I was also able to create this issue with the google-cloud-bigquery package with the following code:

from google.cloud import bigquery
from google.cloud.bigquery import SchemaField
import time

client = bigquery.Client(project=<your_project_id>)

dataset = client.dataset('test_dataset')
if not dataset.exists():
    dataset.create()

SCHEMA = [
    SchemaField('full_name', 'STRING', mode='required'),
    SchemaField('age', 'INTEGER', mode='required'),
]

table = dataset.table('test_table', SCHEMA)

if table.exists:
    try:
        table.delete()
    except:
        pass
    
table.create()
ROWS_TO_INSERT = [
    (u'Phred Phlyntstone', 32),
    (u'Wylma Phlyntstone', 29),
]
table.insert_data(ROWS_TO_INSERT)

# Now change the schema
SCHEMA = [
    SchemaField('name', 'STRING', mode='required'),
    SchemaField('age', 'STRING', mode='required'),
]
table = dataset.table('test_table', SCHEMA)

# Delete the table, wait 2 minutes and re-create the table
table.delete()
time.sleep(120)
table.create()

ROWS_TO_INSERT = [
    (u'Phred Phlyntstone', '32'),
    (u'Wylma Phlyntstone', '29'),
]
for _ in range(5):
    insert_errors = table.insert_data(ROWS_TO_INSERT)
    if len(insert_errors):
        print(insert_errors)
        print('Retrying')
    else:
        break

The output was :

>>[{'index': 0, 'errors': [{u'debugInfo': u'generic::not_found: no such field.', u'reason': u'invalid', u'message': u'no such field.', u'location': u'name'}]}, {'index': 1, 'errors': [{u'debugInfo': u'generic::not_found: no such field.', u'reason': u'invalid', u'message': u'no such field.', u'location': u'name'}]}]
>>Retrying

but prior to July 11th (or so) the retry wasn't required.

One thing that google-cloud-bigquery does is return streaming insert errors rather than raising StreamingInsertError like we do in pandas-gbq. See https://github.com/GoogleCloudPlatform/google-cloud-python/blob/master/bigquery/google/cloud/bigquery/table.py#L826 .

We could follow a similar behaviour and add a return in to_gbq which contains the streaming insert errors rather than raising StreamingInsertError. We can leave it up to the user to check for streaming insert errors and retry if needed https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L1056

internalError from to_gbq when service account doesn't have 'can edit' rights to BQ dataset

Command:

to_gbq(df, destination_table='dataset_table',project_id='foo',verbose=False,if_exists='replace',private_key='path/to/key')

Output:

GenericGBQException: Reason: internalError, Message: An internal error occurred and the request could not be completed.

I guess there is some other error metadata included in the API response but it's not printed with the exception message.

How to replicate?

try to call to_gbq to a table with a service account which doesn't have rights to destination table's dataset

When I added the rights for a dataset, DF got smoothly written into a table.

Use google-auth library and requests instead of google-api-client

The google-auth library provides a Requests interface.

The google-auth library is preferred over the oauth2client library.

Would this project be amenable to a PR to update how the requests to BigQuery are made?