BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.

License: Apache License 2.0

Scala 19.93% Dockerfile 0.02% Java 79.45% Python 0.16% Shell 0.43%

bigquery-storage-api bigquery google-bigquery spark google-cloud google-cloud-dataproc

spark-bigquery-connector's Introduction

Apache Spark SQL connector for Google BigQuery

The connector supports reading Google BigQuery tables into Spark's DataFrames, and writing DataFrames back into BigQuery. This is done by using the Spark SQL Data Source API to communicate with BigQuery.

BigQuery Storage API

The Storage API streams data in parallel directly from BigQuery via gRPC without using Google Cloud Storage as an intermediary.

It has a number of advantages over using the previous export-based read flow that should generally lead to better read performance:

Direct Streaming

It does not leave any temporary files in Google Cloud Storage. Rows are read directly from BigQuery servers using the Arrow or Avro wire formats.

Filtering

The new API allows column and predicate filtering to only read the data you are interested in.

Column Filtering

Since BigQuery is backed by a columnar datastore, it can efficiently stream data without reading all columns.

Predicate Filtering

The Storage API supports arbitrary pushdown of predicate filters. Connector version 0.8.0-beta and above support pushdown of arbitrary filters to Bigquery.

There is a known issue in Spark that does not allow pushdown of filters on nested fields. For example - filters like address.city = "Sunnyvale" will not get pushdown to Bigquery.

Dynamic Sharding

The API rebalances records between readers until they all complete. This means that all Map phases will finish nearly concurrently. See this blog article on how dynamic sharding is similarly used in Google Cloud Dataflow.

See Configuring Partitioning for more details.

Requirements

Enable the BigQuery Storage API

Follow these instructions.

Create a Google Cloud Dataproc cluster (Optional)

If you do not have an Apache Spark environment you can create a Cloud Dataproc cluster with pre-configured auth. The following examples assume you are using Cloud Dataproc, but you can use spark-submit on any cluster.

Any Dataproc cluster using the API needs the 'bigquery' or 'cloud-platform' scopes. Dataproc clusters have the 'bigquery' scope by default, so most clusters in enabled projects should work by default e.g.

MY_CLUSTER=...
gcloud dataproc clusters create "$MY_CLUSTER"

Downloading and Using the Connector

The latest version of the connector is publicly available in the following links:

version	Link
Spark 3.5	`gs://spark-lib/bigquery/spark-3.5-bigquery-0.37.0.jar`(HTTP link)
Spark 3.4	`gs://spark-lib/bigquery/spark-3.4-bigquery-0.37.0.jar`(HTTP link)
Spark 3.3	`gs://spark-lib/bigquery/spark-3.3-bigquery-0.37.0.jar`(HTTP link)
Spark 3.2	`gs://spark-lib/bigquery/spark-3.2-bigquery-0.37.0.jar`(HTTP link)
Spark 3.1	`gs://spark-lib/bigquery/spark-3.1-bigquery-0.37.0.jar`(HTTP link)
Spark 2.4	`gs://spark-lib/bigquery/spark-2.4-bigquery-0.37.0.jar`(HTTP link)
Scala 2.13	`gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.13-0.37.0.jar` (HTTP link)
Scala 2.12	`gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.37.0.jar` (HTTP link)
Scala 2.11	`gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.11-0.29.0.jar` (HTTP link)

The first four versions are Java based connectors targeting Spark 2.4/3.1/3.2/3.3 of all Scala versions built on the new Data Source APIs (Data Source API v2) of Spark.

The final two connectors are Scala based connectors, please use the jar relevant to your Spark installation as outlined below.

Connector to Spark Compatibility Matrix

Connector \ Spark	2.3	2.4	3.0	3.1	3.2	3.3	3.4	3.5
spark-3.5-bigquery								✓
spark-3.4-bigquery							✓	✓
spark-3.3-bigquery						✓	✓	✓
spark-3.2-bigquery					✓	✓	✓	✓
spark-3.1-bigquery				✓	✓	✓	✓	✓
spark-2.4-bigquery		✓
spark-bigquery-with-dependencies_2.13					✓	✓	✓	✓
spark-bigquery-with-dependencies_2.12		✓	✓	✓	✓	✓	✓	✓
spark-bigquery-with-dependencies_2.11	✓	✓

Connector to Dataproc Image Compatibility Matrix

Connector \ Dataproc Image	1.3	1.4	1.5	2.0	2.1	2.2	Serverless Image 1.0	Serverless Image 2.0	Serverless Image 2.1	Serverless Image 2.2
spark-3.5-bigquery						✓				✓
spark-3.4-bigquery						✓			✓	✓
spark-3.3-bigquery					✓	✓	✓	✓	✓	✓
spark-3.2-bigquery					✓	✓	✓	✓	✓	✓
spark-3.1-bigquery				✓	✓	✓	✓	✓	✓	✓
spark-2.4-bigquery		✓	✓
spark-bigquery-with-dependencies_2.13								✓	✓	✓
spark-bigquery-with-dependencies_2.12			✓	✓	✓	✓	✓
spark-bigquery-with-dependencies_2.11	✓	✓

Maven / Ivy Package Usage

The connector is also available from the Maven Central repository. It can be used using the --packages option or the spark.jars.packages configuration property. Use the following value

version	Connector Artifact
Spark 3.5	`com.google.cloud.spark:spark-3.5-bigquery:0.37.0`
Spark 3.4	`com.google.cloud.spark:spark-3.4-bigquery:0.37.0`
Spark 3.3	`com.google.cloud.spark:spark-3.3-bigquery:0.37.0`
Spark 3.2	`com.google.cloud.spark:spark-3.2-bigquery:0.37.0`
Spark 3.1	`com.google.cloud.spark:spark-3.1-bigquery:0.37.0`
Spark 2.4	`com.google.cloud.spark:spark-2.4-bigquery:0.37.0`
Scala 2.13	`com.google.cloud.spark:spark-bigquery-with-dependencies_2.13:0.37.0`
Scala 2.12	`com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.37.0`
Scala 2.11	`com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.29.0`

Specifying the Spark BigQuery connector version in a Dataproc cluster

Dataproc clusters created using image 2.1 and above, or batches using the Dataproc serverless service come with built-in Spark BigQuery connector. Using the standard --jars or --packages (or alternatively, the spark.jars/spark.jars.packages configuration) won't help in this case as the built-in connector takes precedence.

To use another version than the built-in one, please do one of the following:

For Dataproc clusters, using image 2.1 and above, add the following flag on cluster creation to upgrade the version --metadata SPARK_BQ_CONNECTOR_VERSION=0.37.0, or --metadata SPARK_BQ_CONNECTOR_URL=gs://spark-lib/bigquery/spark-3.3-bigquery-0.37.0.jar to create the cluster with a different jar. The URL can point to any valid connector JAR for the cluster's Spark version.
For Dataproc serverless batches, add the following property on batch creation to upgrade the version: --properties dataproc.sparkBqConnector.version=0.37.0, or --properties dataproc.sparkBqConnector.uri=gs://spark-lib/bigquery/spark-3.3-bigquery-0.37.0.jar to create the batch with a different jar. The URL can point to any valid connector JAR for the runtime's Spark version.

Hello World Example

You can run a simple PySpark wordcount against the API without compilation by running

Dataproc image 1.5 and above

gcloud dataproc jobs submit pyspark --cluster "$MY_CLUSTER" \
  --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.37.0.jar \
  examples/python/shakespeare.py

Dataproc image 1.4 and below

gcloud dataproc jobs submit pyspark --cluster "$MY_CLUSTER" \
  --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.11-0.29.0.jar \
  examples/python/shakespeare.py

Example Codelab

https://codelabs.developers.google.com/codelabs/pyspark-bigquery

Usage

The connector uses the cross language Spark SQL Data Source API:

Reading data from a BigQuery table

df = spark.read \
  .format("bigquery") \
  .load("bigquery-public-data.samples.shakespeare")

or the Scala only implicit API:

import com.google.cloud.spark.bigquery._
val df = spark.read.bigquery("bigquery-public-data.samples.shakespeare")

For more information, see additional code samples in Python, Scala and Java.

Reading data from a BigQuery query

The connector allows you to run any Standard SQL SELECT query on BigQuery and fetch its results directly to a Spark Dataframe. This is easily done as described in the following code sample:

spark.conf.set("viewsEnabled","true")
spark.conf.set("materializationDataset","<dataset>")

sql = """
  SELECT tag, COUNT(*) c
  FROM (
    SELECT SPLIT(tags, '|') tags
    FROM `bigquery-public-data.stackoverflow.posts_questions` a
    WHERE EXTRACT(YEAR FROM creation_date)>=2014
  ), UNNEST(tags) tag
  GROUP BY 1
  ORDER BY 2 DESC
  LIMIT 10
  """
df = spark.read.format("bigquery").load(sql)
df.show()

Which yields the result

+----------+-------+
|       tag|      c|
+----------+-------+
|javascript|1643617|
|    python|1352904|
|      java|1218220|
|   android| 913638|
|       php| 911806|
|        c#| 905331|
|      html| 769499|
|    jquery| 608071|
|       css| 510343|
|       c++| 458938|
+----------+-------+

A second option is to use the query option like this:

df = spark.read.format("bigquery").option("query", sql).load()

Notice that the execution should be faster as only the result is transmitted over the wire. In a similar fashion the queries can include JOINs more efficiently then running joins on Spark or use other BigQuery features such as subqueries, BigQuery user defined functions, wildcard tables, BigQuery ML and more.

In order to use this feature the following configurations MUST be set:

viewsEnabled must be set to true.
materializationDataset must be set to a dataset where the GCP user has table creation permission. materializationProject is optional.

Note: As mentioned in the BigQuery documentation, the queried tables must be in the same location as the materializationDataset. Also, if the tables in the SQL statement are from projects other than the parentProject then use the fully qualified table name i.e. [project].[dataset].[table].

Important: This feature is implemented by running the query on BigQuery and saving the result into a temporary table, of which Spark will read the results from. This may add additional costs on your BigQuery account.

Reading From Views

The connector has a preliminary support for reading from BigQuery views. Please note there are a few caveats:

BigQuery views are not materialized by default, which means that the connector needs to materialize them before it can read them. This process affects the read performance, even before running any collect() or count() action.
The materialization process can also incur additional costs to your BigQuery bill.
By default, the materialized views are created in the same project and dataset. Those can be configured by the optional materializationProject and materializationDataset options, respectively. These options can also be globally set by calling spark.conf.set(...) before reading the views.
Reading from views is disabled by default. In order to enable it, either set the viewsEnabled option when reading the specific view (.option("viewsEnabled", "true")) or set it globally by calling spark.conf.set("viewsEnabled", "true").
As mentioned in the BigQuery documentation, the materializationDataset should be in same location as the view.

Writing data to BigQuery

Writing DataFrames to BigQuery can be done using two methods: Direct and Indirect.

Direct write using the BigQuery Storage Write API

In this method the data is written directly to BigQuery using the BigQuery Storage Write API. In order to enable this option, please set the writeMethod option to direct, as shown below:

df.write \
  .format("bigquery") \
  .option("writeMethod", "direct") \
  .save("dataset.table")

Writing to existing partitioned tables (date partitioned, ingestion time partitioned and range partitioned) in APPEND save mode and OVERWRITE mode (only date and range partitioned) is fully supported by the connector and the BigQuery Storage Write API. The use of datePartition, partitionField, partitionType, partitionRangeStart, partitionRangeEnd, partitionRangeInterval described below is not supported at this moment by the direct write method.

Important: Please refer to the data ingestion pricing page regarding the BigQuery Storage Write API pricing.

Important: Please use version 0.24.2 and above for direct writes, as previous versions have a bug that may cause a table deletion in certain cases.

Indirect write

In this method the data is written first to GCS, and then it is loaded it to BigQuery. A GCS bucket must be configured to indicate the temporary data location.

df.write \
  .format("bigquery") \
  .option("temporaryGcsBucket","some-bucket") \
  .save("dataset.table")

The data is temporarily stored using the Apache Parquet, Apache ORC or Apache Avro formats.

The GCS bucket and the format can also be set globally using Spark's RuntimeConfig like this:

spark.conf.set("temporaryGcsBucket","some-bucket")
df.write \
  .format("bigquery") \
  .save("dataset.table")

When streaming a DataFrame to BigQuery, each batch is written in the same manner as a non-streaming DataFrame. Note that a HDFS compatible checkpoint location (eg: path/to/HDFS/dir or gs://checkpoint-bucket/checkpointDir) must be specified.

df.writeStream \
  .format("bigquery") \
  .option("temporaryGcsBucket","some-bucket") \
  .option("checkpointLocation", "some-location") \
  .option("table", "dataset.table")

Important: The connector does not configure the GCS connector, in order to avoid conflict with another GCS connector, if exists. In order to use the write capabilities of the connector, please configure the GCS connector on your cluster as explained here.

Properties

The API Supports a number of options to configure the read

Property	Meaning	Usage
`table`	The BigQuery table in the format `[[project:]dataset.]table`. It is recommended to use the `path` parameter of `load()`/`save()` instead. This option has been deprecated and will be removed in a future version. (Deprecated)	Read/Write
`dataset`	The dataset containing the table. This option should be used with standard table and views, but not when loading query results. (Optional unless omitted in `table`)	Read/Write
`project`	The Google Cloud Project ID of the table. This option should be used with standard table and views, but not when loading query results. (Optional. Defaults to the project of the Service Account being used)	Read/Write
`parentProject`	The Google Cloud Project ID of the table to bill for the export. (Optional. Defaults to the project of the Service Account being used)	Read/Write
`maxParallelism`	The maximal number of partitions to split the data into. Actual number may be less if BigQuery deems the data small enough. If there are not enough executors to schedule a reader per partition, some partitions may be empty. Important: The old parameter (`parallelism`) is still supported but in deprecated mode. It will ve removed in version 1.0 of the connector. (Optional. Defaults to the larger of the preferredMinParallelism and 20,000).)	Read
`preferredMinParallelism`	The preferred minimal number of partitions to split the data into. Actual number may be less if BigQuery deems the data small enough. If there are not enough executors to schedule a reader per partition, some partitions may be empty. (Optional. Defaults to the smallest of 3 times the application's default parallelism and maxParallelism.)	Read
`viewsEnabled`	Enables the connector to read from views and not only tables. Please read the relevant section before activating this option. (Optional. Defaults to `false`)	Read
`materializationProject`	The project id where the materialized view is going to be created (Optional. Defaults to view's project id)	Read
`materializationDataset`	The dataset where the materialized view is going to be created. This dataset should be in same location as the view or the queried tables. (Optional. Defaults to view's dataset)	Read
`materializationExpirationTimeInMinutes`	The expiration time of the temporary table holding the materialized data of a view or a query, in minutes. Notice that the connector may re-use the temporary table due to the use of local cache and in order to reduce BigQuery computation, so very low values may cause errors. The value must be a positive integer. (Optional. Defaults to 1440, or 24 hours)	Read
`readDataFormat`	Data Format for reading from BigQuery. Options : `ARROW`, `AVRO` (Optional. Defaults to `ARROW`)	Read
`optimizedEmptyProjection`	The connector uses an optimized empty projection (select without any columns) logic, used for `count()` execution. This logic takes the data directly from the table metadata or performs a much efficient `SELECT COUNT(*) WHERE...` in case there is a filter. You can cancel the use of this logic by setting this option to `false`. (Optional, defaults to `true`)	Read
`pushAllFilters`	If set to `true`, the connector pushes all the filters Spark can delegate to BigQuery Storage API. This reduces amount of data that needs to be sent from BigQuery Storage API servers to Spark clients. This option has been deprecated and will be removed in a future version. (Optional, defaults to `true`) (Deprecated)	Read
`bigQueryJobLabel`	Can be used to add labels to the connector initiated query and load BigQuery jobs. Multiple labels can be set. (Optional)	Read
`bigQueryTableLabel`	Can be used to add labels to the table while writing to a table. Multiple labels can be set. (Optional)	Write
`traceApplicationName`	Application name used to trace BigQuery Storage read and write sessions. Setting the application name is required to set the trace ID on the sessions. (Optional)	Read
`traceJobId`	Job ID used to trace BigQuery Storage read and write sessions. (Optional, defaults to the Dataproc job ID is exists, otherwise uses the Spark application ID)	Read
`createDisposition`	Specifies whether the job is allowed to create new tables. The permitted values are: `CREATE_IF_NEEDED` - Configures the job to create the table if it does not exist. `CREATE_NEVER` - Configures the job to fail if the table does not exist. This option takes place only in case Spark has decided to write data to the table based on the SaveMode. (Optional. Default to CREATE_IF_NEEDED).	Write
`writeMethod`	Controls the method in which the data is written to BigQuery. Available values are `direct` to use the BigQuery Storage Write API and `indirect` which writes the data first to GCS and then triggers a BigQuery load operation. See more here (Optional, defaults to `indirect`)	Write
`writeAtLeastOnce`	Guarantees that data is written to BigQuery at least once. This is a lesser guarantee than exactly once. This is suitable for streaming scenarios in which data is continuously being written in small batches. (Optional. Defaults to `false`) Supported only by the `DIRECT` write method and mode is NOT* `Overwrite`.*	Write
`temporaryGcsBucket`	The GCS bucket that temporarily holds the data before it is loaded to BigQuery. Required unless set in the Spark configuration (`spark.conf.set(...)`). Not supported by the `DIRECT` write method.	Write
`persistentGcsBucket`	The GCS bucket that holds the data before it is loaded to BigQuery. If informed, the data won't be deleted after write data into BigQuery. Not supported by the `DIRECT` write method.	Write
`persistentGcsPath`	The GCS path that holds the data before it is loaded to BigQuery. Used only with `persistentGcsBucket`. Not supported by the `DIRECT` write method.	Write
`intermediateFormat`	The format of the data before it is loaded to BigQuery, values can be either "parquet","orc" or "avro". In order to use the Avro format, the spark-avro package must be added in runtime. (Optional. Defaults to `parquet`). On write only. Supported only for the `INDIRECT` write method.	Write
`useAvroLogicalTypes`	When loading from Avro (`.option("intermediateFormat", "avro")`), BigQuery uses the underlying Avro types instead of the logical types [by default](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#logical_types). Supplying this option converts Avro logical types to their corresponding BigQuery data types. (Optional. Defaults to `false`). On write only.	Write
`datePartition`	The date partition the data is going to be written to. Should be a date string given in the format `YYYYMMDD`. Can be used to overwrite the data of a single partition, like this: `df.write.format("bigquery") .option("datePartition", "20220331") .mode("overwrite") .save("table")` (Optional). On write only. Can also be used with different partition types like: HOUR: `YYYYMMDDHH` MONTH: `YYYYMM` YEAR: `YYYY` Not supported by the `DIRECT` write method.	Write
`partitionField`	If this field is specified, the table is partitioned by this field. For Time partitioning, specify together with the option `partitionType`. For Integer-range partitioning, specify together with the 3 options: `partitionRangeStart`, `partitionRangeEnd, `partitionRangeInterval`. The field must be a top-level TIMESTAMP or DATE field for Time partitioning, or INT64 for Integer-range partitioning. Its mode must be NULLABLE or REQUIRED. If the option is not set for a Time partitioned table, then the table will be partitioned by pseudo column, referenced via either`'_PARTITIONTIME' as TIMESTAMP` type, or `'_PARTITIONDATE' as DATE` type. (Optional). Not supported by the `DIRECT` write method.	Write
`partitionExpirationMs`	Number of milliseconds for which to keep the storage for partitions in the table. The storage in a partition will have an expiration time of its partition time plus this value. (Optional). Not supported by the `DIRECT` write method.	Write
`partitionType`	Used to specify Time partitioning. Supported types are: `HOUR, DAY, MONTH, YEAR` This option is mandatory for a target table to be Time partitioned. (Optional. Defaults to DAY if PartitionField is specified). Not supported by the `DIRECT` write method.	Write
`partitionRangeStart`, `partitionRangeEnd`, `partitionRangeInterval`	Used to specify Integer-range partitioning. These options are mandatory for a target table to be Integer-range partitioned. All 3 options must be specified. Not supported by the `DIRECT` write method.	Write
`clusteredFields`	A string of non-repeated, top level columns seperated by comma. (Optional).	Write
`allowFieldAddition`	Adds the ALLOW_FIELD_ADDITION SchemaUpdateOption to the BigQuery LoadJob. Allowed values are `true` and `false`. (Optional. Default to `false`). Supported only by the `INDIRECT` write method.	Write
`allowFieldRelaxation`	Adds the ALLOW_FIELD_RELAXATION SchemaUpdateOption to the BigQuery LoadJob. Allowed values are `true` and `false`. (Optional. Default to `false`). Supported only by the `INDIRECT` write method.	Write
`proxyAddress`	Address of the proxy server. The proxy must be a HTTP proxy and address should be in the `host:port` format. Can be alternatively set in the Spark configuration (`spark.conf.set(...)`) or in Hadoop Configuration (`fs.gs.proxy.address`). (Optional. Required only if connecting to GCP via proxy.)	Read/Write
`proxyUsername`	The userName used to connect to the proxy. Can be alternatively set in the Spark configuration (`spark.conf.set(...)`) or in Hadoop Configuration (`fs.gs.proxy.username`). (Optional. Required only if connecting to GCP via proxy with authentication.)	Read/Write
`proxyPassword`	The password used to connect to the proxy. Can be alternatively set in the Spark configuration (`spark.conf.set(...)`) or in Hadoop Configuration (`fs.gs.proxy.password`). (Optional. Required only if connecting to GCP via proxy with authentication.)	Read/Write
`httpMaxRetry`	The maximum number of retries for the low-level HTTP requests to BigQuery. Can be alternatively set in the Spark configuration (`spark.conf.set("httpMaxRetry", ...)`) or in Hadoop Configuration (`fs.gs.http.max.retry`). (Optional. Default is 10)	Read/Write
`httpConnectTimeout`	The timeout in milliseconds to establish a connection with BigQuery. Can be alternatively set in the Spark configuration (`spark.conf.set("httpConnectTimeout", ...)`) or in Hadoop Configuration (`fs.gs.http.connect-timeout`). (Optional. Default is 60000 ms. 0 for an infinite timeout, a negative number for 20000)	Read/Write
`httpReadTimeout`	The timeout in milliseconds to read data from an established connection. Can be alternatively set in the Spark configuration (`spark.conf.set("httpReadTimeout", ...)`) or in Hadoop Configuration (`fs.gs.http.read-timeout`). (Optional. Default is 60000 ms. 0 for an infinite timeout, a negative number for 20000)	Read
`arrowCompressionCodec`	Compression codec while reading from a BigQuery table when using Arrow format. Options : `ZSTD (Zstandard compression)`, `LZ4_FRAME (https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md)`, `COMPRESSION_UNSPECIFIED`. The recommended compression codec is `ZSTD` while using Java. (Optional. Defaults to `COMPRESSION_UNSPECIFIED` which means no compression will be used)	Read
`cacheExpirationTimeInMinutes`	The expiration time of the in-memory cache storing query information. To disable caching, set the value to 0. (Optional. Defaults to 15 minutes)	Read
`enableModeCheckForSchemaFields`	Checks the mode of every field in destination schema to be equal to the mode in corresponding source field schema, during DIRECT write. Default value is true i.e., the check is done by default. If set to false the mode check is ignored.	Write
`enableListInference`	Indicates whether to use schema inference specifically when the mode is Parquet (https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#parquetoptions). Defaults to false.	Write
`bqChannelPoolSize`	The (fixed) size of the gRPC channel pool created by the BigQueryReadClient. For optimal performance, this should be set to at least the number of cores on the cluster executors.	Read
`createReadSessionTimeoutInSeconds`	The timeout in seconds to create a ReadSession when reading a table. For Extremely large table this value should be increased. (Optional. Defaults to 600 seconds)	Read
`queryJobPriority`	Priority levels set for the job while reading data from BigQuery query. The permitted values are: `BATCH` - Query is queued and started as soon as idle resources are available, usually within a few minutes. If the query hasn't started within 3 hours, its priority is changed to `INTERACTIVE`. `INTERACTIVE` - Query is executed as soon as possible and count towards the concurrent rate limit and the daily rate limit. For WRITE, this option will be effective when DIRECT write is used with OVERWRITE mode, where the connector overwrites the destination table using MERGE statement. (Optional. Defaults to `INTERACTIVE`)	Read/Write
`destinationTableKmsKeyName`	Describes the Cloud KMS encryption key that will be used to protect destination BigQuery table. The BigQuery Service Account associated with your project requires access to this encryption key. for further Information about using CMEK with BigQuery see [here](https://cloud.google.com/bigquery/docs/customer-managed-encryption#key_resource_id). Notice: The table will be encrypted by the key only if it created by the connector. A pre-existing unencrypted table won't be encrypted just by setting this option. (Optional)	Write
`allowMapTypeConversion`	Boolean config to disable conversion from BigQuery records to Spark MapType when the record has two subfields with field names as `key` and `value`. Default value is `true` which allows the conversion. (Optional)	Read
`spark.sql.sources.partitionOverwriteMode`	Config to specify the overwrite mode on write when the table is range/time partitioned. Currently supportd two modes : `STATIC` and `DYNAMIC`. In `STATIC` mode, the entire table is overwritten. In `DYNAMIC` mode, the data is overwritten by partitions of the existing table. The default value is `STATIC`. (Optional)	Write
`enableReadSessionCaching`	Boolean config to disable read session caching. Caches BigQuery read sessions to allow for faster Spark query planning. Default value is `true`. (Optional)	Read
`readSessionCacheDurationMins`	Config to set the read session caching duration in minutes. Only works if `enableReadSessionCaching` is `true` (default). Allows specifying the duration to cache read sessions for. Maximum allowed value is `300`. Default value is `5`. (Optional)	Read
`bigQueryJobTimeoutInMinutes`	Config to set the BigQuery job timeout in minutes. Default value is `360` minutes. (Optional)	Read/Write

Options can also be set outside of the code, using the --conf parameter of spark-submit or --properties parameter of the gcloud dataproc submit spark. In order to use this, prepend the prefix spark.datasource.bigquery. to any of the options, for example spark.conf.set("temporaryGcsBucket", "some-bucket") can also be set as --conf spark.datasource.bigquery.temporaryGcsBucket=some-bucket.

Data types

With the exception of DATETIME and TIME all BigQuery data types directed map into the corresponding Spark SQL data type. Here are all of the mappings:

BigQuery Standard SQL Data Type	Spark SQL Data Type	Notes
`BOOL`	`BooleanType`
`INT64`	`LongType`
`FLOAT64`	`DoubleType`
`NUMERIC`	`DecimalType`	Please refer to Numeric and BigNumeric support
`BIGNUMERIC`	`DecimalType`	Please refer to Numeric and BigNumeric support
`STRING`	`StringType`
`BYTES`	`BinaryType`
`STRUCT`	`StructType`
`ARRAY`	`ArrayType`
`TIMESTAMP`	`TimestampType`
`DATE`	`DateType`
`DATETIME`	`StringType`, `TimestampNTZType`*	Spark has no DATETIME type. Spark string can be written to an existing BQ DATETIME column provided it is in the format for BQ DATETIME literals. * For Spark 3.4+, BQ DATETIME is read as Spark's TimestampNTZ type i.e. java LocalDateTime
`TIME`	`LongType`, `StringType`*	Spark has no TIME type. The generated longs, which indicate microseconds since midnight can be safely cast to TimestampType, but this causes the date to be inferred as the current day. Thus times are left as longs and user can cast if they like. When casting to Timestamp TIME have the same TimeZone issues as DATETIME * Spark string can be written to an existing BQ TIME column provided it is in the format for BQ TIME literals.
`JSON`	`StringType`	Spark has no JSON type. The values are read as String. In order to write JSON back to BigQuery, the following conditions are REQUIRED: Use the `INDIRECT` write method Use the `AVRO` intermediate format The DataFrame field MUST be of type `String` and has an entry of sqlType=JSON in its metadata
`ARRAY<STRUCT<key,value>>`	`MapType`	BigQuery has no MAP type, therefore similar to other conversions like Apache Avro and BigQuery Load jobs, the connector converts a Spark Map to a REPEATED STRUCT<key,value>. This means that while writing and reading of maps is available, running a SQL on BigQuery that uses map semantics is not supported. To refer to the map's values using BigQuery SQL, please check the BigQuery documentation. Due to these incompatibilities, a few restrictions apply: Keys can be Strings only Values can be simple types (not structs) For INDIRECT write, use the `AVRO` intermediate format. DIRECT write is supported as well

Spark ML Data Types Support

The Spark ML Vector and Matrix are supported, including their dense and sparse versions. The data is saved as a BigQuery RECORD. Notice that a suffix is added to the field's description which includes the spark type of the field.

In order to write those types to BigQuery, use the ORC or Avro intermediate format, and have them as column of the Row (i.e. not a field in a struct).

Numeric and BigNumeric support

BigQuery's BigNumeric has a precision of 76.76 (the 77th digit is partial) and scale of 38. Since this precision and scale is beyond spark's DecimalType (38 scale and 38 precision) support, it means that BigNumeric fields with precision larger than 38 cannot be used. Once this Spark limitation will be updated the connector will be updated accordingly.

The Spark Decimal/BigQuery Numeric conversion tries to preserve the parameterization of the type, i.e NUMERIC(10,2) will be converted to Decimal(10,2) and vice versa. Notice however that there are cases where the parameters are lost. This means that the parameters will be reverted to the defaults - NUMERIC (38,9) and BIGNUMERIC(76,38). This means that at the moment, BigNumeric read is supported only from a standard table, but not from BigQuery view or when reading data from a BigQuery query.

Filtering

The connector automatically computes column and pushdown filters the DataFrame's SELECT statement e.g.

spark.read.bigquery("bigquery-public-data:samples.shakespeare")
  .select("word")
  .where("word = 'Hamlet' or word = 'Claudius'")
  .collect()

filters to the column word and pushed down the predicate filter word = 'hamlet' or word = 'Claudius'.

If you do not wish to make multiple read requests to BigQuery, you can cache the DataFrame before filtering e.g.:

val cachedDF = spark.read.bigquery("bigquery-public-data:samples.shakespeare").cache()
val rows = cachedDF.select("word")
  .where("word = 'Hamlet'")
  .collect()
// All of the table was cached and this doesn't require an API call
val otherRows = cachedDF.select("word_count")
  .where("word = 'Romeo'")
  .collect()

You can also manually specify the filter option, which will override automatic pushdown and Spark will do the rest of the filtering in the client.

Partitioned Tables

The pseudo columns _PARTITIONDATE and _PARTITIONTIME are not part of the table schema. Therefore in order to query by the partitions of partitioned tables do not use the where() method shown above. Instead, add a filter option in the following manner:

val df = spark.read.format("bigquery")
  .option("filter", "_PARTITIONDATE > '2019-01-01'")
  ...
  .load(TABLE)

Configuring Partitioning

By default the connector creates one partition per 400MB in the table being read (before filtering). This should roughly correspond to the maximum number of readers supported by the BigQuery Storage API. This can be configured explicitly with the maxParallelism property. BigQuery may limit the number of partitions based on server constraints.

Tagging BigQuery Resources

In order to support tracking the usage of BigQuery resources the connectors offers the following options to tag BigQuery resources:

Adding BigQuery Jobs Labels

The connector can launch BigQuery load and query jobs. Adding labels to the jobs is done in the following manner:

spark.conf.set("bigQueryJobLabel.cost_center", "analytics")
spark.conf.set("bigQueryJobLabel.usage", "nightly_etl")

This will create labels cost_center=analytics and usage=nightly_etl.

Adding BigQuery Storage Trace ID

Used to annotate the read and write sessions. The trace ID is of the format Spark:ApplicationName:JobID. This is an opt-in option, and to use it the user need to set the traceApplicationName property. JobID is auto generated by the Dataproc job ID, with a fallback to the Spark application ID (such as application_1648082975639_0001). The Job ID can be overridden by setting the traceJobId option. Notice that the total length of the trace ID cannot be over 256 characters.

Using in Jupyter Notebooks

The connector can be used in Jupyter notebooks even if it is not installed on the Spark cluster. It can be added as an external jar in using the following code:

Python:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
  .config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.37.0") \
  .getOrCreate()
df = spark.read.format("bigquery") \
  .load("dataset.table")

Scala:

val spark = SparkSession.builder
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.37.0")
.getOrCreate()
val df = spark.read.format("bigquery")
.load("dataset.table")

In case Spark cluster is using Scala 2.12 (it's optional for Spark 2.4.x, mandatory in 3.0.x), then the relevant package is com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.37.0. In order to know which Scala version is used, please run the following code:

Python:

spark.sparkContext._jvm.scala.util.Properties.versionString()

Scala:

scala.util.Properties.versionString

Compiling against the connector

Unless you wish to use the implicit Scala API spark.read.bigquery("TABLE_ID"), there is no need to compile against the connector.

To include the connector in your project:

Maven

<dependency>
  <groupId>com.google.cloud.spark</groupId>
  <artifactId>spark-bigquery-with-dependencies_${scala.version}</artifactId>
  <version>0.37.0</version>
</dependency>

SBT

libraryDependencies += "com.google.cloud.spark" %% "spark-bigquery-with-dependencies" % "0.37.0"

Connector metrics and how to view them

Spark populates a lot of metrics which can be found by the end user in the spark history page. But all these metrics are spark related which are implicitly collected without any change from the connector. But there are few metrics which are populated from the BigQuery and currently are visible in the application logs which can be read in the driver/executor logs.

From Spark 3.2 onwards, spark has provided the API to expose custom metrics in the spark UI page https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/connector/metric/CustomMetric.html

Currently, using this API, connector exposes the following bigquery metrics during read

Metric Name	Description
`bytes read`	number of BigQuery bytes read
`rows read`	number of BigQuery rows read
`scan time`	the amount of time spent between read rows response requested to obtained across all the executors, in milliseconds.
`parse time`	the amount of time spent for parsing the rows read across all the executors, in milliseconds.
`spark time`	the amount of time spent in spark to process the queries (i.e., apart from scanning and parsing), across all the executors, in milliseconds.

Note: To use the metrics in the Spark UI page, you need to make sure the spark-bigquery-metrics-0.37.0.jar is the class path before starting the history-server and the connector version is spark-3.2 or above.

FAQ

What is the Pricing for the Storage API?

See the BigQuery pricing documentation.

I have very few partitions

You can manually set the number of partitions with the maxParallelism property. BigQuery may provide fewer partitions than you ask for. See Configuring Partitioning.

You can also always repartition after reading in Spark.

I get quota exceeded errors while writing

If there are too many partitions the CreateWriteStream or Throughput quotas may be exceeded. This occurs because while the data within each partition is processed serially, independent partitions may be processed in parallel on different nodes within the spark cluster. Generally, to ensure maximum sustained throughput you should file a quota increase request. However, you can also manually reduce the number of partitions being written by calling coalesce on the DataFrame to mitigate this problem.

desiredPartitionCount = 5
dfNew = df.coalesce(desiredPartitionCount)
dfNew.write

A rule of thumb is to have a single partition handle at least 1GB of data.

Also note that a job running with the writeAtLeastOnce property turned on will not encounter CreateWriteStream quota errors.

How do I authenticate outside GCE / Dataproc?

The connector needs an instance of a GoogleCredentials in order to connect to the BigQuery APIs. There are multiple options to provide it:

The default is to load the JSON key from the GOOGLE_APPLICATION_CREDENTIALS environment variable, as described here.
In case the environment variable cannot be changed, the credentials file can be configured as as a spark option. The file should reside on the same path on all the nodes of the cluster.

// Globally
spark.conf.set("credentialsFile", "</path/to/key/file>")
// Per read/Write
spark.read.format("bigquery").option("credentialsFile", "</path/to/key/file>")

Credentials can also be provided explicitly, either as a parameter or from Spark runtime configuration. They should be passed in as a base64-encoded string directly.

// Globally
spark.conf.set("credentials", "<SERVICE_ACCOUNT_JSON_IN_BASE64>")
// Per read/Write
spark.read.format("bigquery").option("credentials", "<SERVICE_ACCOUNT_JSON_IN_BASE64>")

In cases where the user has an internal service providing the Google AccessToken, a custom implementation can be done, creating only the AccessToken and providing its TTL. Token refresh will re-generate a new token. In order to use this, implement the com.google.cloud.bigquery.connector.common.AccessTokenProvider interface. The fully qualified class name of the implementation should be provided in the gcpAccessTokenProvider option. AccessTokenProvider must be implemented in Java or other JVM language such as Scala or Kotlin. It must either have a no-arg constructor or a constructor accepting a single java.util.String argument. This configuration parameter can be supplied using the gcpAccessTokenProviderConfig option. If this is not provided then the no-arg constructor wil be called. The jar containing the implementation should be on the cluster's classpath.

// Globally
spark.conf.set("gcpAccessTokenProvider", "com.example.ExampleAccessTokenProvider")
// Per read/Write
spark.read.format("bigquery").option("gcpAccessTokenProvider", "com.example.ExampleAccessTokenProvider")

Service account impersonation can be configured for a specific username and a group name, or for all users by default using below properties:
- gcpImpersonationServiceAccountForUser_<USER_NAME> (not set by default)
  
  The service account impersonation for a specific user.
- gcpImpersonationServiceAccountForGroup_<GROUP_NAME> (not set by default)
  
  The service account impersonation for a specific group.
- gcpImpersonationServiceAccount (not set by default)
  
  Default service account impersonation for all users.
If any of the above properties are set then the service account specified will be impersonated by generating a short-lived credentials when accessing BigQuery.

If more than one property is set then the service account associated with the username will take precedence over the service account associated with the group name for a matching user and group, which in turn will take precedence over default service account impersonation.
For a simpler application, where access token refresh is not required, another alternative is to pass the access token as the gcpAccessToken configuration option. You can get the access token by running gcloud auth application-default print-access-token.

// Globally
spark.conf.set("gcpAccessToken", "<access-token>")
// Per read/Write
spark.read.format("bigquery").option("gcpAccessToken", "<acccess-token>")

Important: The CredentialsProvider and AccessTokenProvider need to be implemented in Java or other JVM language such as Scala or Kotlin. The jar containing the implementation should be on the cluster's classpath.

Notice: Only one of the above options should be provided.

How do I connect to GCP/BigQuery via Proxy?

To connect to a forward proxy and to authenticate the user credentials, configure the following options.

proxyAddress: Address of the proxy server. The proxy must be an HTTP proxy and address should be in the host:port format.

proxyUsername: The userName used to connect to the proxy.

proxyPassword: The password used to connect to the proxy.

val df = spark.read.format("bigquery")
  .option("proxyAddress", "http://my-proxy:1234")
  .option("proxyUsername", "my-username")
  .option("proxyPassword", "my-password")
  .load("some-table")

The same proxy parameters can also be set globally using Spark's RuntimeConfig like this:

spark.conf.set("proxyAddress", "http://my-proxy:1234")
spark.conf.set("proxyUsername", "my-username")
spark.conf.set("proxyPassword", "my-password")

val df = spark.read.format("bigquery")
  .load("some-table")

You can set the following in the hadoop configuration as well.

fs.gs.proxy.address(similar to "proxyAddress"), fs.gs.proxy.username(similar to "proxyUsername") and fs.gs.proxy.password(similar to "proxyPassword").

If the same parameter is set at multiple places the order of priority is as follows:

option("key", "value") > spark.conf > hadoop configuration

spark-bigquery-connector's People

Contributors

Stargazers

Watchers

Forkers

cyxxy pmkc kmjung darshanrd kjmrknsn toldervoll xjrk58 alfonsorr esobolievv aryann achelimed aniket486 functicons gauravn7 obviouslyai mrcsparker guitcastro azemabaptiste eliasah asm0dey siddharth1001 minsu-daniel-kim davidrabinowitz vprus mavencode mayanksingh09 tanishgupta1 ohaionm datadrivers subnader chriskuchar giomerlin skuehn ajitgogul tfayyaz raangs mngeow-msa mayurdb bkvarda vamosraghava stephaniewang526 diggerk gaurangi94 isflwrs j450h1 leonardo-jas betoleal gopinath678 yuvalmedina leoneuwald emkornfield varundhussa afghifariii changpingc mingjialiu sevakavet aminesagaama balvant01 mohamed-a-abdelaziz cpcgoogle jbergbrede jiangmichaellll raymond-lu6 isabella232 wangso bskim45 esert-g rubik-ai tensorady kenegozi rehman04 ranu010101 aag-peace deepaknettem himanshukohli09 maasg chie8842 sandeep-katta0102 jiening123 surgachsurgach dee-pac youngwookim nicholas-fwang glthu shrinivas-io vpipkt prabowst codeforcontribute whirldata trickster vvaks0 muthugit carlosmarin hsuyuming vinaylondhe14 jake8591789 sobelek madhax hakchealkim khileshchauhan

spark-bigquery-connector's Issues

Reading BigQuery table in PySpark outside GCP

My server runs on AWS.
I follow the instructions here and this tutorial script, but I get Py4JJavaError caused by missing project ID

Caused by: java.lang.IllegalArgumentException: A project ID is required for this service but could not be determined from the builder or the environment.  Please set a project ID using the builder.
	at com.google.cloud.spark.bigquery.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:142)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.ServiceOptions.<init>(ServiceOptions.java:266)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions.<init>(BigQueryOptions.java:81)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions.<init>(BigQueryOptions.java:30)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions$Builder.build(BigQueryOptions.java:76)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions.getDefaultInstance(BigQueryOptions.java:136)
	at com.google.cloud.spark.bigquery.BigQueryRelationProvider$.$lessinit$greater$default$2(BigQueryRelationProvider.scala:30)
	at com.google.cloud.spark.bigquery.BigQueryRelationProvider.<init>(BigQueryRelationProvider.scala:40)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at java.lang.Class.newInstance(Class.java:442)
	at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
	... 24 more

My python script looks like this:

from pyspark.sql import SparkSession
spark = (
    SparkSession.builder
    .appName('bq')
    .master('local[4]')
    .config('spark.jars', '/path/to/bigquery_spark-bigquery-latest.jar')
    .getOrCreate()
)
spark.conf.set("credentialsFile", "/path/to/credentials.json")

df = (
    spark.read
    .format('bigquery')
    .option('project', 'myProject')
    .option('table', 'myTable')
    .load()
)

Any idea how could I fix the missing project ID error?

SparkContext.defaultParallelism is not great with dynamic allocation

Google Cloud Dataproc uses Spark's dynamic allocation by default which causes very few partitions to be requested relative to the size of the cluster.

In contrast the MapReduce BigQuery connector and Reading from GCS create one partition per ~128MB.

I did not write the Spark connector to do this, because it did not behave well with the initial sharding strategy. With the balanced Sharding strategy we should switch to that model

Publish shaded profile to maven central

IMO end users should never use the unshaded dependency and it's not available in maven central:
https://search.maven.org/artifact/com.google.cloud.spark/spark-bigquery_2.11/0.8.0-beta/jar.

Not sure how to do that with SBT, but we really should.

.option("filter", filter) will overwrite the pushdown filters

Currently, if we run

spark.read.format("bigquery").option("fitler", "_PARTITIONDATE > '2019-01-01'")....createOrReplaceTempView("t")
spark.sql("select * from t where a =1").show()

Then only the partition column filter is pushed down.
Have you considered combining both predicates with "AND"?

Add write support

This is a major feature request.

There is not currently Write support in the BigQuery Storage API. However the old insert rows request or loading via GCS could be used.

Can not run outside GCP (for example: AWS EMR or on Local)

Hi,

I got this error when running outside GCP.

Versions used: 0.6.0-beta and 0.7.0-beta

2019-07-04 09:55:32 INFO  ComputeEngineCredentials:202 - Failed to detect whether we are running on Google Compute Engine.
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
        at com.google.api.gax.retrying.BasicRetryingFuture.<init>(BasicRetryingFuture.java:84)
        at com.google.api.gax.retrying.DirectRetryingExecutor.createFuture(DirectRetryingExecutor.java:88)
        at com.google.api.gax.retrying.DirectRetryingExecutor.createFuture(DirectRetryingExecutor.java:74)
        at com.google.cloud.RetryHelper.run(RetryHelper.java:75)
        at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
        at com.google.cloud.bigquery.BigQueryImpl.getTable(BigQueryImpl.java:471)
        at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelationInternal(BigQueryRelationProvider.scala:55)
        at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:37)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
        at com.company.package.MainClass$$anonfun$main$1.apply(Visits.scala:90)
        at com.company.package.MainClass$$anonfun$main$1.apply(Visits.scala:69)
        at scala.Option.foreach(Option.scala:257)
        at com.company.package.MainClass$.main(Visits.scala:69)
        at com.company.package.MainClass.main(Visits.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

My main class looks like:

spark.conf.set("credentials", base64String)

val data = spark.read
          .format("bigquery")
          .option("credentials", cred) // set again the cred to be sure
          .option("parentProject", "myProjectId")
          .option("project", myProjectId)
          .option("dataset", myDatasetId)
          .option("table", "myProjectId:myDatasetId.ga_sessions_YYYMMdd")
          .load()

my SBT file looks like:

ThisBuild / scalaVersion := "2.11.11"

lazy val hephaestus = (project in file("."))
  .settings(
    name := appName,
    version := appVersion,
    libraryDependencies ++= Seq(
      "org.apache.spark"       %% "spark-sql"      % "2.3.0" % "provided",
      "com.amazonaws"          % "aws-java-sdk"    % "1.7.4" % "provided",
      "org.apache.hadoop"      % "hadoop-aws"      % "2.7.1" % "provided",
      "com.google.cloud.spark" %% "spark-bigquery" % "0.7.0-beta"
    )
  )

assemblyJarName in assembly := s"$appName-$appVersion.jar"

assemblyShadeRules in assembly := Seq(
  ShadeRule.rename("com.google.guava.**"        -> "repackaged.com.google.guava.@1").inAll,
  ShadeRule.rename("com.google.common.guava.**" -> "repackaged.com.google.common.guava.@1").inAll,
  ShadeRule.rename("com.google.protobuf.**"     -> "repackaged.com.google.protobuf.@1").inAll
)

Deployment on AWS EMR v5.13:
export GOOGLE_CLOUD_PROJECT=myProjectId

spark-submit \
    --class package.MainClass \
    --master "yarn" \
    --deploy-mode "client" \
    --conf "spark.hadoop.fs.s3a.access.key=********************" \
    --conf "spark.hadoop.fs.s3a.secret.key=*********************" \
    --conf "spark.hadoop.fs.s3a.endpoint=s3.eu-west-1.amazonaws.com" \
    --conf "spark.yarn.appMasterEnv.GOOGLE_CLOUD_PROJECT=myProjectId" \
    application-version.jar

Possibility to change grpc message size for a client

Sometimes we got following error in this connector:

io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: Sent message larger than max (10485777 vs. 10485760)

It would be good if library client will able to change grpc message size.
For now message size is standard - 10 * 1024 * 1024

Provide doc on making the connector available in Jupyter

This was asked on SO.

Loading a table with regions other than US will not load.

Although loading a table from the US server will work fine, loading a table thats from outside US will return an error (Mine is currently located on asia-east1) that shows

io.grpc.StatusRuntimeException: UNAVAILABLE: Policy checks are unavailable.
com.google.api.gax.rpc.UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: Policy checks are unavailable.
	at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:69)
	at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
	at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
	at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
	at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)

Can't build a jar file

[INFO] --- maven-assembly-plugin:2.5.5:single (make-assembly) @ lego-cloud ---
[INFO] Failure detected.
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10.634 s
[INFO] Finished at: 2019-09-14T15:17:44+03:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.5.5:single (make-assembly) on project lego-cloud: Failed to create assembly: Unable to resolve dependencies for assembly 'jar-with-dependencies': Failed to resolve dependencies for assembly: The artifact has no valid ranges
[ERROR] io.grpc:grpc-core:jar:1.21.0
[ERROR]
[ERROR] Path to dependency:
[ERROR] 1) com.openx.lego:lego-cloud:jar:1.3-SNAPSHOT
[ERROR] 2) com.google.cloud.spark:spark-bigquery_2.11:jar:0.8.0-beta
[ERROR] 3) com.google.cloud:google-cloud-bigquerystorage:jar:0.98.0-beta
[ERROR] 4) com.google.api:gax-grpc:jar:1.46.1
[ERROR] 5) io.grpc:grpc-alts:jar:1.21.0
[ERROR] -> [Help 1]

Support specifying auth through Spark Configuration

DynamicAllocation is ignored when I use the connector.

If I don't use the bigquery connector, then my project will DynamicAllocate to make use of the cluster automatically.
Once I add the call to the bq connector, the job is no longer dynamic and uses the specified spark.executor.cores, etc. I verified that the configs in the environment are identical and have dynamicAllocation enabled.

can not read from BQ table

I am trying to read data from BigQuery in my dataproc spark job
using below code

val df_bpp: DataFrame = spark.read.bigquery("mintreporting.TEST_DS.spr_sample") df_bpp.printSchema(); df_bpp.show(10);

The schema for my table is getting printed properly but the code fails while executing the last line (show()). Below is the stack trace.

Any kind of help is highly appreciated.

9/07/02 08:02:47 ERROR io.grpc.internal.ManagedChannelImpl: [Channel<1>: (bigquerystorage.googleapis.com:443)] Uncaught exception in the SynchronizationContext. Panic!
java.lang.NoSuchMethodError: io.grpc.internal.ClientTransportFactory$ClientTransportOptions.getProxyParameters()Lio/grpc/internal/ProxyParameters;
at io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder$NettyTransportFactory.newClientTransport(NettyChannelBuilder.java:542)
at io.grpc.internal.CallCredentialsApplyingTransportFactory.newClientTransport(CallCredentialsApplyingTransportFactory.java:48)
at io.grpc.internal.InternalSubchannel.startNewTransport(InternalSubchannel.java:263)
at io.grpc.internal.InternalSubchannel.obtainActiveTransport(InternalSubchannel.java:216)
at io.grpc.internal.ManagedChannelImpl$SubchannelImpl.requestConnection(ManagedChannelImpl.java:1452)
at io.grpc.internal.PickFirstLoadBalancer.handleResolvedAddressGroups(PickFirstLoadBalancer.java:59)
at io.grpc.internal.AutoConfiguredLoadBalancerFactory$AutoConfiguredLoadBalancer.handleResolvedAddressGroups(AutoConfiguredLoadBalancerFactory.java:148)
at io.grpc.internal.ManagedChannelImpl$NameResolverListenerImpl$1NamesResolved.run(ManagedChannelImpl.java:1326)
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:101)
at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:130)
at io.grpc.internal.ManagedChannelImpl$NameResolverListenerImpl.onAddresses(ManagedChannelImpl.java:1331)
at io.grpc.internal.DnsNameResolver$Resolve.resolveInternal(DnsNameResolver.java:318)
at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:220)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Exception in thread "main" com.google.api.gax.rpc.InternalException: io.grpc.StatusRuntimeException: INTERNAL: Panic! This is a bug!
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:67)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)
at repackaged.com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1070)
at repackaged.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at repackaged.com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1139)
at repackaged.com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958)
at repackaged.com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:748)
at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:507)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:482)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:699)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57)
at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112)
at com.google.cloud.bigquery.storage.v1beta1.BigQueryStorageClient.createReadSession(BigQueryStorageClient.java:237)
at com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation.buildScan(DirectBigQueryRelation.scala:84)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:326)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:325)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:403)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:321)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:289)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3359)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2544)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2758)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
at org.apache.spark.sql.Dataset.show(Dataset.scala:745)
at org.apache.spark.sql.Dataset.show(Dataset.scala:704)
at transformations.SPRBarcPPJob$.flattenSPRBARC(SPRBarcPPJob.scala:42)
at LoadData$.main(LoadData.scala:34)
at LoadData.main(LoadData.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: io.grpc.StatusRuntimeException: INTERNAL: Panic! This is a bug!
at io.grpc.Status.asRuntimeException(Status.java:532)
... 23 more
Caused by: java.lang.NoSuchMethodError: io.grpc.internal.ClientTransportFactory$ClientTransportOptions.getProxyParameters()Lio/grpc/internal/ProxyParameters;
at io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder$NettyTransportFactory.newClientTransport(NettyChannelBuilder.java:542)
at io.grpc.internal.CallCredentialsApplyingTransportFactory.newClientTransport(CallCredentialsApplyingTransportFactory.java:48)
at io.grpc.internal.InternalSubchannel.startNewTransport(InternalSubchannel.java:263)
at io.grpc.internal.InternalSubchannel.obtainActiveTransport(InternalSubchannel.java:216)
at io.grpc.internal.ManagedChannelImpl$SubchannelImpl.requestConnection(ManagedChannelImpl.java:1452)
at io.grpc.internal.PickFirstLoadBalancer.handleResolvedAddressGroups(PickFirstLoadBalancer.java:59)
at io.grpc.internal.AutoConfiguredLoadBalancerFactory$AutoConfiguredLoadBalancer.handleResolvedAddressGroups(AutoConfiguredLoadBalancerFactory.java:148)
at io.grpc.internal.ManagedChannelImpl$NameResolverListenerImpl$1NamesResolved.run(ManagedChannelImpl.java:1326)
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:101)
at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:130)
at io.grpc.internal.ManagedChannelImpl$NameResolverListenerImpl.onAddresses(ManagedChannelImpl.java:1331)
at io.grpc.internal.DnsNameResolver$Resolve.resolveInternal(DnsNameResolver.java:318)
at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:220)
... 3 more
19/07/02 08:02:47 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@6fca5907{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
ERROR: (gcloud.dataproc.jobs.submit.spark) Job [920e05ddb57643fbbc6bf980e7b351bb] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found in 'gs://uat-mint-dataproc/google-cloud-dataproc-metainfo/9d999556-1033-457c-a954-05c6c4b3dcaf/jobs/920e05ddb57643fbbc6bf980e7b351bb/driveroutput'.

Failed to find data source: bigquery.

Hi, I am using this connector to read data from GCP BigQuery. And I have a problem. When I run my code locally from my IntelliJ it works good. But when I submit my code to google cloud using gcloud dataproc jobs submit spark I got an exception:

19/09/01 09:42:34 ERROR org.apache.spark.deploy.yarn.Client: Application diagnostics message: Application application_1565982347302_17757 failed 2 times due to AM Container for appattempt_1565982347302_17757_000002 exited with  exitCode: 13
Failing this attempt.Diagnostics: [2019-09-01 09:42:33.971]Exception from container-launch.
Container id: container_e01_1565982347302_17757_02_000001
Exit code: 13

[2019-09-01 09:42:33.972]Container exited with a non-zero exit code 13. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
rce
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
        at scala.util.Try$.apply(Try.scala:192)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
        at scala.util.Try.orElse(Try.scala:84)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
        ... 9 more
19/09/01 09:42:32 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: Uncaught exception: 
org.apache.spark.SparkException: Exception thrown in awaitResult: 
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
        at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
        at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
        at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
        at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
        at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
        at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
        at BigQueryTest.GCPSparkBigQueryConnector(BigQueryTest.java:19)
        at BigQueryTest.main(BigQueryTest.java:15)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
Caused by: java.lang.ClassNotFoundException: bigquery.DefaultSource
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
        at scala.util.Try$.apply(Try.scala:192)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
        at scala.util.Try.orElse(Try.scala:84)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
        ... 9 more
19/09/01 09:42:33 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@14abe743{HTTP/1.1,[http/1.1]}{0.0.0.0:0}


[2019-09-01 09:42:33.974]Container exited with a non-zero exit code 13. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
rce
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
        at scala.util.Try$.apply(Try.scala:192)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
        at scala.util.Try.orElse(Try.scala:84)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
        ... 9 more
19/09/01 09:42:32 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: Uncaught exception: 
org.apache.spark.SparkException: Exception thrown in awaitResult: 
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
        at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
        at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
        at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
        at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
        at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
        at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
        at BigQueryTest.GCPSparkBigQueryConnector(BigQueryTest.java:19)
        at BigQueryTest.main(BigQueryTest.java:15)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
Caused by: java.lang.ClassNotFoundException: bigquery.DefaultSource
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
        at scala.util.Try$.apply(Try.scala:192)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
        at scala.util.Try.orElse(Try.scala:84)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
        ... 9 more
19/09/01 09:42:33 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@14abe743{HTTP/1.1,[http/1.1]}{0.0.0.0:0}


For more detailed output, check the application tracking page: http://ox-data-compute-grid-us-central1-m-0:8188/applicationhistory/app/application_1565982347302_17757 Then click on links to logs of each attempt.
. Failing the application.
Exception in thread "main" org.apache.spark.SparkException: Application application_1565982347302_17757 finished with failed status
        at org.apache.spark.deploy.yarn.Client.run(Client.scala:1149)
        at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1526)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
ERROR: (gcloud.dataproc.jobs.submit.spark) Job [7679a153384e47a39a74ad6eac71ee28] failed with error:
Job failed with message [Exception in thread "main" org.apache.spark.SparkException: Application application_1565982347302_17757 finished with failed status]. Additional details can be found in 'gs://ox-data-qa-us-central1-dataproc-staging/google-cloud-dataproc-metainfo/9c47905a-9f26-4aed-b653-29bd36d079c6/jobs/7679a153384e47a39a74ad6eac71ee28/driveroutput'.

I got this error for both cases:

Dataset<Row> df = sparkSession.read().format("bigquery").option("table", fullyQualifiedInputTableId).load();
df.show();

and

Dataset<Row> dfSelect = package$.MODULE$.BigQueryDataFrameReader(sparkSession.read()).bigquery(fullyQualifiedInputTableId).select("*").limit(10).cache();
dfSelect.show();

Here is my pom.xml file:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.3</version>
        </dependency>
        <dependency>
            <groupId>com.google.cloud.spark</groupId>
            <artifactId>spark-bigquery_2.11</artifactId>
            <version>0.8.0-beta</version>
        </dependency>
        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>28.1-jre</version>
        </dependency>
            <dependency>
                <groupId>com.spotify</groupId>
                <artifactId>spark-bigquery_2.11</artifactId>
                <version>0.2.2</version>
            </dependency>
        <dependency>
            <groupId>com.google.cloud.bigdataoss</groupId>
            <artifactId>bigquery-connector</artifactId>
            <version>hadoop3-1.0.0</version>
        </dependency>
    </dependencies>

    <groupId>big.query.test</groupId>
    <artifactId>big.query.test</artifactId>
    <version>1.0-SNAPSHOT</version>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source>8</source>
                    <target>8</target>
                </configuration>
            </plugin>
            <plugin>
                <!-- Build an executable JAR -->
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
                <version>3.1.0</version>
                <configuration>
                    <archive>
                        <manifest>
                            <addClasspath>true</addClasspath>
                            <classpathPrefix>lib/</classpathPrefix>
                            <mainClass>BigQueryTest</mainClass>
                        </manifest>
                    </archive>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass>BigQueryTest</mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id> <!-- this is used for inheritance merges -->
                        <phase>package</phase> <!-- bind to the packaging phase -->
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

Filter pushdown fails for dates specified as strings

Feedback from a customer on the connector. This may be an issue with the API or the connector, unsure yet.

The connector doesn't behave as we'd expect it to when working with conditions on date types, see below. Seems that's easy to fix, please see if that can be included into next versions.The problem with date types is the following.
If we use the following code to get necessary partitions from a partitioned table we actually get a full scan over all partitions because the condition is not pushed down:
val df = spark.read.bigquery(TABLE).load()where(col("date_col") >= "2019-01-01")
If we transform the literal to the date type explicitly it works fine:
val df = spark.read.bigquery(TABLE).load().where(col("date_col") >= to_date(lit("2019-01-01")))
Also it works fine if the condition is provided as an option to the reader
val df = spark.read.bigquery(TABLE).option("filter", "data_col >= '2019-01-01'").load()

GoogleJsonResponseException: 401 Unauthorized

I am trying to save a dataframe from a spark job running on an AWS EMR cluster to BigQuery, but am recieving a 401. What is wrong with my spark code? Could it be my service does not have overwrite permissions? This service has the role BigQuery Admin

Also, it would be nicer if we could supply an object directly inside the option configurations instead of base64 or a json file. I am having trouble using the JSON file while my cluster is running because files are not the same across driver and worker nodes. base64 is also not working for me due to some unknown encoding issue.... Being able to directly set these variables would make creating an EMR/Spark + BigQuery workload alot easier. Thank you for any and all help.

credentials.json format

{
  "type": "service_account",
  "project_id": "<MY_PROJECT_NAME>",
  "private_key_id": "<PRIVATE_KEY_ID>",
  "private_key": "-----BEGIN PRIVATE KEY-----<LONG LIST OF CHARS>-----END PRIVATE KEY-----\n",
  "client_email": "[email protected]",
  "client_id": "<CLIENT_ID>",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/<service>%40<project>.iam.gserviceaccount.com"
}

Spark Script

SPARK_CONTEXT.addFile("/home/hadoop/credentials.json")
SPARK_CONTEXT._jsc \
    .hadoopConfiguration() \
    .set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
SPARK_CONTEXT._jsc \
    .hadoopConfiguration() \
    .set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
SPARK_CONTEXT._jsc \
    .hadoopConfiguration() \
    .set("fs.gs.project.id", "<MY_GCLOUD_PROJECT_ID>")
SPARK_CONTEXT._jsc \
    .hadoopConfiguration() \
    .set("google.cloud.auth.service.account.json.keyfile", SparkFiles.get("credentials.json"))


def main():
    #  ... init dataframe ...
    df \
        .write \
        .format("bigquery") \
        .option("temporaryGcsBucket", "emr_spark") \
        .option("project", "<MY_GCLOUD_PROJECT_ID>") \
        .option("parentProject", "<MY_GCLOUD_PROJECT_ID>") \
        .option("table", "<MY_GCLOUD_PROJECT_ID>:dataset.table") \
        .mode("overwrite") \
        .save()
    return

Error:

19/12/10 18:25:33 INFO CodecConfig: Compression: SNAPPY
19/12/10 18:25:33 INFO CodecConfig: Compression: SNAPPY
19/12/10 18:25:33 INFO ParquetOutputFormat: Parquet block size to 134217728
19/12/10 18:25:33 INFO ParquetOutputFormat: Parquet page size to 1048576
19/12/10 18:25:33 INFO ParquetOutputFormat: Parquet dictionary page size to 1048576
19/12/10 18:25:33 INFO ParquetOutputFormat: Dictionary is on
19/12/10 18:25:33 INFO ParquetOutputFormat: Validation is off
19/12/10 18:25:33 INFO ParquetOutputFormat: Writer version is: PARQUET_1_0
19/12/10 18:25:33 INFO ParquetOutputFormat: Maximum row group padding size is 8388608 bytes
19/12/10 18:25:33 INFO ParquetOutputFormat: Page size checking is: estimated
19/12/10 18:25:33 INFO ParquetOutputFormat: Min row count for page size check is: 100
19/12/10 18:25:33 INFO ParquetOutputFormat: Max row count for page size check is: 10000
19/12/10 18:25:34 INFO ParquetWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
{
  "type" : "struct",
  "fields" : [ {
    "name" : "id",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  } ...(other fields)...]
}
and corresponding Parquet message type:
message spark_schema {
  optional binary id (UTF8);
  ...(other fields)...
}


19/12/10 18:25:34 INFO SparkHadoopMapRedUtil: No need to commit output of task because needsTaskCommit=false: attempt_20191210182527_0004_m_000048_108
19/12/10 18:25:34 INFO Executor: Finished task 48.0 in stage 4.0 (TID 108). 4769 bytes result sent to driver
19/12/10 18:25:34 INFO TaskSetManager: Finished task 48.0 in stage 4.0 (TID 108) in 255 ms on localhost (executor driver) (47/50)
19/12/10 18:25:34 INFO SparkHadoopMapRedUtil: No need to commit output of task because needsTaskCommit=false: attempt_20191210182527_0004_m_000049_109
19/12/10 18:25:34 INFO Executor: Finished task 49.0 in stage 4.0 (TID 109). 4769 bytes result sent to driver
19/12/10 18:25:34 INFO TaskSetManager: Finished task 49.0 in stage 4.0 (TID 109) in 243 ms on localhost (executor driver) (48/50)
19/12/10 18:25:34 INFO SparkHadoopMapRedUtil: No need to commit output of task because needsTaskCommit=false: attempt_20191210182527_0004_m_000027_110
19/12/10 18:25:34 INFO Executor: Finished task 27.0 in stage 4.0 (TID 110). 4769 bytes result sent to driver
19/12/10 18:25:34 INFO TaskSetManager: Finished task 27.0 in stage 4.0 (TID 110) in 247 ms on localhost (executor driver) (49/50)
19/12/10 18:25:34 INFO InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 9511
19/12/10 18:25:36 INFO FileOutputCommitter: Saved output of task 'attempt_20191210182527_0004_m_000039_111' to gs://emr_spark/.spark-bigquery-local-1576002318252-36378c2a-2355-4ecf-97ce-943287f00af4/_temporary/0/task_20191210182527_0004_m_000039
19/12/10 18:25:36 INFO SparkHadoopMapRedUtil: attempt_20191210182527_0004_m_000039_111: Committed
19/12/10 18:25:36 INFO Executor: Finished task 39.0 in stage 4.0 (TID 111). 4941 bytes result sent to driver
19/12/10 18:25:36 INFO TaskSetManager: Finished task 39.0 in stage 4.0 (TID 111) in 2836 ms on localhost (executor driver) (50/50)
19/12/10 18:25:36 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
19/12/10 18:25:36 INFO DAGScheduler: ResultStage 4 (save at BigQueryWriteHelper.scala:65) finished in 7.592 s
19/12/10 18:25:36 INFO DAGScheduler: Job 0 finished: save at BigQueryWriteHelper.scala:65, took 9.741856 s
19/12/10 18:25:41 INFO FileFormatWriter: Write Job 52a4641c-d06d-4adc-9a51-ce52fd17ef4c committed.
19/12/10 18:25:41 INFO FileFormatWriter: Finished processing stats for write job 52a4641c-d06d-4adc-9a51-ce52fd17ef4c.
Traceback (most recent call last):
  File "/home/hadoop/test.py", line 179, in <module>
    main()
  File "/home/hadoop/test.py", line 71, in main
    .mode("append") \
  File "/home/hadoop/conda/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 736, in save
  File "/home/hadoop/conda/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/home/hadoop/conda/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/home/hadoop/conda/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o138.save.
: java.lang.RuntimeException: Failed to write to BigQuery
	at com.google.cloud.spark.bigquery.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.scala:60)
	at com.google.cloud.spark.bigquery.BigQueryInsertableRelation.insert(BigQueryInsertableRelation.scala:44)
	at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:83)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: 401 Unauthorized
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:106)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.create(HttpBigQueryRpc.java:206)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$5.call(BigQueryImpl.java:319)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$5.call(BigQueryImpl.java:316)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.RetryHelper.run(RetryHelper.java:76)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl.create(BigQueryImpl.java:315)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl.create(BigQueryImpl.java:290)
	at com.google.cloud.spark.bigquery.BigQueryWriteHelper.loadDataToBigQuery(BigQueryWriteHelper.scala:90)
	at com.google.cloud.spark.bigquery.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.scala:66)
	... 33 more
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.json.GoogleJsonResponseException: 401 Unauthorized
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:443)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1092)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:541)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:474)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:591)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.create(HttpBigQueryRpc.java:204)
	... 42 more

Clean up documentation

It's currently all in the top level README and a little rough

Support for catalog

Is there an API to read list of tables in a bigquery dataset (or in all datasets in a bigquery project) using the spark bigquery connector?

grpc getting a "Received unexpected EOS on DATA frame from server." error

I'm trying to load a table with around 3 billion rows (311 GB) in pyspark like this:

df = spark.read.format('bigquery').option('table', table_name).load()

But I'm getting this "unexpected EOS" error when I try to load it, e.g.

df.count()

fwiw, I've been able to intermittently operate on a subset of the data (maybe 10-100 million rows, or so).
Also, I'm happy to move this somewhere else if this isn't the right place for it. :)

Here's the full stacktrace I'm getting:

Py4JJavaError: An error occurred while calling o1078.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 371 in stage 162.0 failed 4 times, most recent failure: Lost task 371.3 in stage 162.0 (TID 37179, cluster-name-w-7.c.gcloud-project.internal, executor 15): com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.InternalException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INTERNAL: Received unexpected EOS on DATA frame from server.
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:67)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.ExceptionResponseObserver.onErrorImpl(ExceptionResponseObserver.java:82)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.StateCheckingResponseObserver.onError(StateCheckingResponseObserver.java:86)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcDirectStreamController$ResponseObserverAdapter.onClose(GrpcDirectStreamController.java:149)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:700)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:399)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:510)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:66)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:630)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$700(ClientCallImpl.java:518)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:692)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:681)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
	Suppressed: java.lang.RuntimeException: Asynchronous task failed
		at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ServerStreamIterator.hasNext(ServerStreamIterator.java:105)
		at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
		at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
		at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
		at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source)
		at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
		at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
		at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
		at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
		at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
		at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
		at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
		at org.apache.spark.scheduler.Task.run(Task.scala:121)
		at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
		at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
		at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
		... 3 more
Caused by: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INTERNAL: Received unexpected EOS on DATA frame from server.
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.Status.asRuntimeException(Status.java:533)
	... 24 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2111)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2060)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2049)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
	at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2830)
	at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2829)
	at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
	at org.apache.spark.sql.Dataset.count(Dataset.scala:2829)
	at sun.reflect.GeneratedMethodAccessor139.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Version 0.9.1-beta is messing from maven-central

Version 0.9.1-beta is messing from the central maven repository. This is due to Sonatype's maintenance, see https://status.maven.org/incidents/byvqpl9v3g1c?u=c7j791s2w6lk

The jars are available at

gs://spark-lib/bigquery/spark-bigquery-latest.jar
gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar

RESOURCE_EXHAUSTED when loading data from BigQuery

I'm trying to load a table with 2.6 billion rows (403 GB) into a Spark Dataset, but I get a StatusRuntimeException with message "RESOURCE_EXHAUSTED: Terminate due to throttling factor over 100" (full stacktrace below). The same code succeeds when running on a sample table with 1 million rows. I'm running on Dataproc version="preview" with n1-highmem-16 master and 14 n1-highmem-8 workers (non-preemtible). Loading code:

    val df = spark.read
      .format("bigquery")
      .option("table", trainingOpts.viewsTableName)
      .load()
    df.createOrReplaceTempView("views")
    val sqlDf = spark.sql("SELECT * FROM views WHERE  length(contentId) > 0 AND date >= '20170101' AND isInUniverse")

contentId and date are strings, isInUniverse is boolean.

Full stacktrace:

2019-03-14 19:22:58 WARN  TaskSetManager:66 - Lost task 55.0 in stage 11.0 (TID 6431, training-tv-thotest-w-10.c.nrk-recommendations.internal, executor 9): com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ResourceExhaustedException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: Terminate due to throttling factor over 100.
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:57)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.ExceptionResponseObserver.onErrorImpl(ExceptionResponseObserver.java:82)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.StateCheckingResponseObserver.onError(StateCheckingResponseObserver.java:86)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcDirectStreamController$ResponseObserverAdapter.onClose(GrpcDirectStreamController.java:149)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:694)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
	Suppressed: java.lang.RuntimeException: Asynchronous task failed
		at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ServerStreamIterator.hasNext(ServerStreamIterator.java:105)
		at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
		at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
		at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:834)
		at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
		at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
		at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
		at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
		at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
		at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
		at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:148)
		at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
		at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
		at org.apache.spark.scheduler.Task.run(Task.scala:121)
		at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
		at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
		at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
		... 3 more
Caused by: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: Terminate due to throttling factor over 100.
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.Status.asRuntimeException(Status.java:530)
	... 23 more

Update: The job succeeded in the end. Loading the data took 19 minutes. For comparison, loading the same data from CSV files exported from BigQuery (which is what we do in production) takes 16 minutes. Using spark-bigquery-connector will save us the time-consuming export-to-csv step, so this is promising stuff. But the RESOURCE_EXHAUSTED message is still a bit disconcerting.

Not able to write to bigquery using connector

I have created a process to read from cloud mysql and write to big query which is supposed to run on dataproc. I am testing it using Jupyter notebook on dataproc.

I have provided bigquery editor and owner access to service account being used.

on running the below command:
result_df.write.format("bigquery").mode("append").option("table", "XXX.xxx").save()

i am getting error:
Py4JJavaError: An error occurred while calling o178.save.
: java.lang.RuntimeException: com.google.cloud.spark.bigquery.BigQueryRelationProvider does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:554)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:278)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

Help!

Best way to authnenticate from AWS EMR?

I am attempting to run a spark script that saves a dataframe to BigQuery from an AWS EMR cluster. I am having a hard time using credentials and credentialsFile to authenticate however.

Issue
When I authenticate using credentialsFile I get a FileNotFound error. Currently I have my credentials.json file for my service stored in s3. When my cluster starts, I download the file from s3, and use SPARK_CONTEXT.addFile("/home/hadoop/credentials.json") to add the file (ideally) to the master and worker nodes. Afterwards I set google.cloud.auth.service.account.json.keyfile using SparkFiles.get("credentials.json") to get the file's path.

This is not working however. I am able to set the path, but when I run my job I see the following error. What is the best/easiest way to authenticate from an AWS EMR cluster. Nothing I've tried seems to work...the encoding feature results in an encoding error, and there is not way to directly supply the information from my credentials file manually into the options while calling df.write

Job aborted due to stage failure: Task 2 in stage 4.0 failed 4 times, most recent failure: Lost task 2.3 in stage 4.0 (TID 92, ip-172-31-9-151.us-west-2.compute.internal, executor 1): java.io.FileNotFoundException: /mnt/tmp/spark-c1e9683c-0c78-43ba-a0b6-0ba787e5b8e2/userFiles-bf45542e-d159-4969-9fec-149d3f59cf03/credentials.json (No such file or directory)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at java.io.FileInputStream.<init>(FileInputStream.java:93)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromJsonKeyFile(CredentialFactory.java:280)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:126)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getCredential(GoogleHadoopFileSystemBase.java:1508)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.createGcsFs(GoogleHadoopFileSystemBase.java:1593)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1554)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:553)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:506)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2859)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2878)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:392)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:123)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:103)
	at org.apache.parquet.hadoop.ParquetOutputCommitter.<init>(ParquetOutputCommitter.java:43)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getOutputCommitter(ParquetOutputFormat.java:442)
	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupCommitter(HadoopMapReduceCommitProtocol.scala:100)
	at org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol.setupCommitter(SQLHadoopMapReduceCommitProtocol.scala:40)
	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupTask(HadoopMapReduceCommitProtocol.scala:217)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:229)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:

Missing maven dependencies when using --packages and ClassNotFound when using --jars

Hi,

I want to play a little bit with the BigQuery connector (on AWS EMR version 5.24.1 with Spark 2.4.2) and run this command: pyspark --packages com.google.cloud.spark:spark-bigquery_2.11:0.9.1-beta. But the following three dependencies seem to be missing in maven central:

javax.jms#jms;1.1!jms.jar
com.sun.jdmk#jmxtools;1.2.1!jmxtools.jar
com.sun.jmx#jmxri;1.2.1!jmxri.jar

As a workaround, I tried to download the JAR from here: https://console.cloud.google.com/storage/browser/spark-lib/bigquery and add it to the classpath with this command: pyspark --jars spark-bigquery-latest.jar. But when I tried to read a table from BigQuery, I get this error: ClassNotFoundException: Failed to find data source: com.google.cloud.spark.bigquery.

I also tried to use com.google.cloud.spark.bigquery instead of just "bigquery" in format(), without success.

FAILED_PRECONDITION: there was an error creating the session: the table has a storage format that is not supported

If I modify the SQL statement in the example (https://github.com/GoogleCloudPlatform/spark-bigquery-connector/blob/master/examples/python/query_results.py) to have a WHERE clause

SELECT * FROM bigquery-public-data.san_francisco.bikeshare_stationss JOINbigquery-public-data.san_francisco.bikeshare_tripst ON s.station_id = t.start_station_id WHERE name = 'Mezes Park'

I am getting error

Waiting for job output...
Querying BigQuery
Reading query results into Spark
Traceback (most recent call last):
File "/tmp/d34ed08caafd490d9d8ec73c0de1c7f0/demo.py", line 30, in
df.show()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 350, in show
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o57.showString.
: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.FailedPreconditionException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: FAILED_PRECONDITION: there was an error creating the session: the table has a storage format that is not supported
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:59)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:982)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1138)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:957)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:748)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:515)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:490)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:700)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:399)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:507)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:66)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:627)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$700(ClientCallImpl.java:515)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:686)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:675)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.storage.v1beta1.BigQueryStorageClient.createReadSession(BigQueryStorageClient.java:237)
at com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation.buildScan(DirectBigQueryRelation.scala:83)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:338)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:337)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:415)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:333)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:289)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3254)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2489)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2703)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
... 1 more
... 1 more
Caused by: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: FAILED_PRECONDITION: there was an error creating the session: the table has a storage format that is not supported
at com.google.cloud.spark.bigquery.repackaged.io.grpc.Status.asRuntimeException(Status.java:533)
... 24 more

ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [d34ed08caafd490d9d8ec73c0de1c7f0] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found in 'gs://dataproc-b888cd53-d066-4ee7-a69f-a8dab6da7af6-us-west2/google-cloud-dataproc-metainfo/e5be50c0-9871-438c-bf8d-dd0f6a42bd04/jobs/d34ed08caafd490d9d8ec73c0de1c7f0/driveroutput'.

Getting error "INVALID_ARGUMENT: Identifier 'order' is a reserved keyword"

This error crops up while loading the dataframe. Coincidentally, a column in the table being loaded has the name "order". Below is the full stack trace:


`com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.InvalidArgumentException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Identifier 'order' is a reserved keyword
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:49)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)
	at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1070)
	at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
	at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1139)
	at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958)
	at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:748)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:507)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:482)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:699)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
	Suppressed: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
		at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57)
		at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112)
		at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.storage.v1beta1.BigQueryStorageClient.createReadSession(BigQueryStorageClient.java:237)
		at com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation.buildScan(DirectBigQueryRelation.scala:84)
		at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293)
		at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293)
		at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:326)
		at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:325)
		at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:381)
		at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:321)
		at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:289)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
		at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
		at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
		at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
		at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
		at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
		at scala.collection.Iterator$class.foreach(Iterator.scala:891)
		at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
		at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
		at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
		at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
		at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
		at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
		at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
		at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
		at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
		at org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:100)
		at org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:67)
		at org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:91)
		at org.apache.spark.sql.Dataset.persist(Dataset.scala:2963)
		at org.apache.spark.sql.Dataset.cache(Dataset.scala:2973)
		at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
		at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
		at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
		at java.lang.reflect.Method.invoke(Method.java:498)
		at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
		at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
		at py4j.Gateway.invoke(Gateway.java:282)
		at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
		at py4j.commands.CallCommand.execute(CallCommand.java:79)
		at py4j.GatewayConnection.run(GatewayConnection.java:238)
		... 1 more
Caused by: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Identifier 'order' is a reserved keyword
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.Status.asRuntimeException(Status.java:532)
	... 23 more`

Is this connector capable of writing to BigQuery?

The doc seemly implies that the connector can only be used for reading from BigQuery, cannot be used to write to BigQuery, hope we clarify it in the doc. It will help users make correct choice between this connector and the other one.

com.google.cloud.spark.bigquery.repackaged.com.google.common.io.BaseEncoding$DecodingException: Invalid input length 25

When I am running locally I can connect to bigquery but running on dataproc is throwing below error
Exception in thread "main" java.lang.IllegalArgumentException: com.google.cloud.spark.bigquery.repackaged.com.google.common.io.BaseEncoding$DecodingException: Invalid input length 25
at com.google.cloud.spark.bigquery.repackaged.com.google.common.io.BaseEncoding.decode(BaseEncoding.java:219)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.util.Base64.decodeBase64(Base64.java:104)
at com.google.cloud.spark.bigquery.SparkBigQueryOptions.createCredentials(SparkBigQueryOptions.scala:47)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider$.createBigQuery(BigQueryRelationProvider.scala:125)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider$$anonfun$getOrCreateBigQuery$1.apply(BigQueryRelationProvider.scala:107)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider$$anonfun$getOrCreateBigQuery$1.apply(BigQueryRelationProvider.scala:107)
at scala.Option.getOrElse(Option.scala:121)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.getOrCreateBigQuery(BigQueryRelationProvider.scala:107)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelationInternal(BigQueryRelationProvider.scala:53)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:37)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at com.paloaltonetworks.GenarateCorruptIdForCleaning$.main(GenarateCorruptIdForCleaning.scala:41)

RESOURCE_EXHAUSTED: a single table row is larger than the maximum message size

We're hitting the following error:

com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ResourceExhaustedException: 
com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: 
RESOURCE_EXHAUSTED: there was an error operating on {REDACTED}: a single table row is larger than the maximum message size

This might be a duplicate of #22

Specifying a custom service account for dataproc cluster causes error

When specifying a custom service account for the dataproc cluster:

data_proc_image: 1.4.0-debian9
spark-bigquery-connector: 0.7.0-beta

A problem has occured while running task
io.grpc.ManagedChannelProvider$ProviderNotFoundException: No functional channel service provider found. Try adding a dependency on the grpc-okhttp, grpc-netty, or grpc-netty-shaded artifact
	at io.grpc.ManagedChannelProvider.provider(ManagedChannelProvider.java:60)
	at io.grpc.ManagedChannelBuilder.forAddress(ManagedChannelBuilder.java:37)
	at com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createSingleChannel(InstantiatingGrpcChannelProvider.java:194)
	at com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createChannel(InstantiatingGrpcChannelProvider.java:169)
	at com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.getTransportChannel(InstantiatingGrpcChannelProvider.java:156)
	at com.google.api.gax.rpc.ClientContext.create(ClientContext.java:157)
	at com.google.cloud.bigquery.storage.v1beta1.stub.EnhancedBigQueryStorageStub.create(EnhancedBigQueryStorageStub.java:90)
	at com.google.cloud.bigquery.storage.v1beta1.BigQueryStorageClient.<init>(BigQueryStorageClient.java:144)
	at com.google.cloud.bigquery.storage.v1beta1.BigQueryStorageClient.create(BigQueryStorageClient.java:125)
	at com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation$.createReadClient(DirectBigQueryRelation.scala:170)
	at com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation$$anonfun$$lessinit$greater$default$3$1.apply(DirectBigQueryRelation.scala:42)
	at com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation$$anonfun$$lessinit$greater$default$3$1.apply(DirectBigQueryRelation.scala:42)
	at com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation.buildScan(DirectBigQueryRelation.scala:81)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:326)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:325)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:381)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:321)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:289)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
	at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
	at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
	at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
	at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
	at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3038)
	at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3036)

Without a custom service account, seems everything is fine. However, i need to use a custom service account for cross account access to BQ:

com.google.cloud.bigquery.BigQueryException: Access Denied: Table [[ omitted ]]: The user [email protected] does not have bigquery.tables.get permission for table

Expand predicate filters

This is blocked on the API limits at the moment.

Dynamically generate tables in integration tests

Add test case covering 'credentials' and 'credentialsFile' options

Zero column projection not handled correctly

When a simple count(*) is done on a table spark pushes down a zero column projection. This connector passes this empty list down to TableReadOptions here:

https://github.com/GoogleCloudPlatform/spark-bigquery-connector/blob/master/src/main/scala/com/google/cloud/spark/bigquery/direct/DirectBigQueryRelation.scala#L79

However the API for this states "If empty, all fields will be read" see https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1beta1#google.cloud.bigquery.storage.v1beta1.TableReadOptions

The result is all columns are read for a simple table count.

Support partitioning pseudo columns

This is tricky, because AFAIK Spark has no support for pseudo columns

val df = spark.read.format("bigquery").option("table", TABLE) \
  .where($"_PARTITIONDATE" > "2019-01-01")

would require materializing _PARTITIONDATE (at least in the schema), which the BigQuery Storage API does not support.

A work around today is to use the filter option

val df = spark.read.format("bigquery").option("table", TABLE) \
  .option("filter", "_PARTITIONDATE > '2019-01-01'")

should work today

Support reading from BQ Views

Currently reading from BigQuery Views results in the below exception. Views are used for abstracting the base tables and provide access controls on BigQuery.

To replicate the issue:

Create a view on BigQuery
Use the view name as Table Name for spark.read operation

Exception:

Exception in thread "main" java.lang.RuntimeException: Table type VIEW is not supported.
at scala.sys.package$.error(package.scala:27)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelationInternal(BigQueryRelationProvider.scala:55)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:40)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at com.google.cloud.spark.bigquery.package$BigQueryDataFrameReader.bigquery(package.scala:24)
at com.it.bi.etl.ATR_Repurchase$.main(ATR_Repurchase.scala:18)
at com.it.bi.etl.ATR_Repurchase.main(ATR_Repurchase.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Add timezone property to support TIME & DATETIME parsing

Change the shaded jar name from classifier to a new artifact name

At the moment the shaded jar is published as a classifier on the original jar (such as com.google.cloud.spark.bigquery:spark-bigquery_2.11:shaded) but unfortunately spark does not permit the usage of a classifier.

It means that the shaded version should be deployed under another artifact name.

How to query bigquery table?

I want to perform a select inner join query on my BigQuery table and get the result. How can I do this using the connector? I know it is possible to apply filters but my sql statement is pretty long and I don't think it'd be possible for me to do this since I also have udfs.

Is there an option I can pass, something like query, to do this?

Uneven distribution of data in partitions

We were trying to run a count query on a public bigquery dataset (with 65GB data / 185,666,648 rows) with this connector. And noticed that initial tasks were taking around 1.5 min and later ones are taking only 0.3 sec.

After some digging what we found out that there was an uneven distribution of rows across that partitions. It divided the query into 140 tasks and out of 140 only initial 62 tasks got assigned some data, others were left empty.

Is this expected? If so, I think the expected behavior should be to equally distribute the tasks or create fewer partitions. I am attaching a snippet of my code.

import com.google.cloud.spark.bigquery._

val df = spark.read
              .option("parallelism", "1024")
              .bigquery("bigquery-public-data:chicago_taxi_trips.taxi_trips")

val sizes = df.rdd.mapPartitions(iter => Array(iter.size).iterator, true).collect

>>> Output
sizes: Array[Int] = Array(2905482, 3286156, 2916555, 3249520, 3038311, 3177630, 3198400, 3472010, 3205490, 3225875, 3192130, 3188205, 3279867, 3200256, 3233956, 2865080, 3272523, 3201758, 3207450, 3165187, 3212764, 2837254, 3209174, 2881729, 3263254, 3295285, 3245724, 3209642, 3261901, 3470650, 3174603, 3228765, 2919202, 3269729, 3237928, 3259147, 3224187, 3309382, 3088440, 3215571, 3212322, 3200407, 3216581, 3192071, 3229153, 3227973, 3260958, 2882677, 2660676, 2872312, 3252960, 3219197, 2826124, 2564174, 2556998, 2593512, 1964467, 2166998, 2152987, 1284772, 1310224, 1522933, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

Support Numeric type

unexpected EOS on DATA frame from server

Hello guys,

Will i'm trying to fetch data from BigQuery, unfortunately i'm getting this issue

For this example the table contains 34GB of data

scala> var df = spark.read.option("parallelism", "100").bigquery("project:dataset.table")
scala> var size = df.rdd.mapPartitions(iter => Array(iter.size).iterator, true).collect
[Stage 0:===> (7 + 6) / 100

Some stages failed with exception caused by: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.InternalException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INTERNAL: Received unexpected EOS on DATA frame from server.

Also the job takes over 10 min to compute the result, which isn't normal

Does anybody face the same issue ?

Thank you guys

Cannot read from dataset in regional location

Hi Guys, first thanks a lot for the spark connector!

I'm not sure if this is issue in the connector itself, or the bigquery java sdk client that is used behind the scene, or the service itself, or lastly my bad somewhere I can't see at the moment (or lack of knowledge). There's also chance I missed something, but it's such a simple hello world example that I doubt, if so prove me wrong, I'll be more than happy.

Namely, I can't read the data from the dataset that is created in regional location (us-east4) whereas multi-regional works just fine. I even tried simplest possible spark application that just reads data from existing table in such dataset and prints few rows, without any sophisticated transformation. What's more interesting write operation works fine in regional location, only read operation does not work, which seems odd.

It looks like there's some connection issue to the service itself and finally it exits with 503 Service Unavailable error.

The exact same simple application works just fine when the dataset is multi-regional (US).

When I debugged the sources, it hangs in DirectBigQueryRelation.scala when it tries to create read session, line 115:

      val session = client.createReadSession(
        CreateReadSessionRequest.newBuilder()
            .setParent(s"projects/${options.parentProject}")
            .setFormat(DataFormat.AVRO)
            .setRequestedStreams(numPartitionsRequested)
            .setReadOptions(readOptions)
            .setTableReference(actualTableReference)
            // The BALANCED sharding strategy causes the server to assign roughly the same
            // number of rows to each stream.
            .setShardingStrategy(ShardingStrategy.BALANCED)
            .build())

I'm getting following exception:

19/11/21 10:38:20 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
Exception in thread "main" com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.UnavailableException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: UNAVAILABLE: 503:Service Unavailable
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:69)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)
	at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1015)
	at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
	at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1137)
	at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:957)
	at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:748)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:515)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:490)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:700)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:399)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:510)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:66)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:630)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$700(ClientCallImpl.java:518)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:692)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:681)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
	at java.util.concurrent.FutureTask.run(FutureTask.java)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
	Suppressed: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
		at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57)
		at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112)
		at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.storage.v1beta1.BigQueryStorageClient.createReadSession(BigQueryStorageClient.java:237)
		at com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation.buildScan(DirectBigQueryRelation.scala:115)
		at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293)
		at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293)
		at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:326)
		at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:325)
		at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:403)
		at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:321)
		at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:289)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
		at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
		at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
		at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
		at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
		at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
		at scala.collection.Iterator$class.foreach(Iterator.scala:891)
		at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
		at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
		at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
		at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
		at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)

I tried to query that dataset using gcloud bq command line tool, and it returned me the data, using exact same credentials I used for my simple spark app, which has Owner role by the way.

I also tried to test that app on Dataproc cluster (with the newest image) that was created in the exact same regional location, and write worked fine whereas read failed with same error.

Finally, I tried all the versions starting from 0.9.x , including recent 0.10.x release with same result, spark was 2.4.4 and scala 2.11.

Credentials configuration inconstent with GSC spark connector

Read https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md#configure-hadoop and then https://github.com/GoogleCloudPlatform/spark-bigquery-connector#how-do-i-authenticate-outside-gce--dataproc
Observed: GCS connector uses keys under google.cloud.auth prefix to configure credentials, for example google.cloud.auth.service.account.json.keyfile. When setting from Spark code, there's additional spark.hadoop prefix. BigQuery connector documentation say to do:
spark.conf.set("credentialsFile", "</path/to/key/file>")
Expected: since both projects aim to interop with Google services closely related to each other, I would expect that I can configure my spark session with a single set of keys. Furthermore, I would not expect to ever use unqualified credentialsFile - if everybody do that, sooner or later two packages would clash.

No data in dataframe when queried from bigquery streaming buffer

Hi,

I stream some data into bigquery and the data are still in bigquery streaming buffer. Then I use this spark-bigquery-connector to query the streaming data from bigquery, however, I always got 0 data in the dataframe. And the data are only available to be queried after they are inserted into bigquery from the streaming buffer. Is this issue designed by default? Or this is a problem you can help to solve?

Thank you!

Reading only requested number of partitions from BQ table doesn't work

I am trying to query table which is partitioned by date field. Like this,

val prog_logs = spark.read.format("bigquery")
        .option("table", "project1:dataset.table")
        .option("filter", " date between '2019-09-10' and  '2019-09-11' ")
        .load()
        .cache()

This is reading entire table instead of only '2019-09-10' and '2019-09-11' partitions.

How to filter Partitioned by field

Hi, MyTable is partitioned by dt (DATE type field) and required partition filter. So, I need to add filter about dt but _PARTITIONDATE option is only for partitioned by ingestion time.
Is it available to use raw query?

Ex.

Table schema(MyTable)

[
    {
        "name": "user_id",
        "type": "INTEGER",
        "mode": "REQUIRED"
    },
    {
        "name": "dt",
        "type": "DATE",
        "mode": "REQUIRED"
    },
]

Code

    val df = spark
      .read
      .format("com.google.cloud.spark.bigquery")
      .option("credentialsFile", keyFilePath)
      .option("parentProject", projectId)
      .option("project", projectId)
      .option("table", tableId)
      .load()
      .cache()

Error

Cannot query over table 'TABLE_ID' without a filter over column(s) 'dt' that can be used for partition elimination

Better documentation about what gets pushed down to BigQuery

Hello,

I'm working through a project using this connector, and I can't find what does and does not get pushed down to BigQuery.

E.g. Where clauses? GroupBy? Distinct? Joins?

Get error GoogleHadoopFileSystem not found when upgrade to the latest lib

Hi,

I got the following error after I upgraded to the latest lib:
Exception in thread "main" java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found at java.util.ServiceLoader.fail(ServiceLoader.java:239) at java.util.ServiceLoader.access$300(ServiceLoader.java:185) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) at java.util.ServiceLoader$1.next(ServiceLoader.java:480)

I didn't get any error when I used the old version of lib (0.8.1-beta).

Could you help to solve this problem?
Thank you!

googleclouddataproc / spark-bigquery-connector Goto Github PK

spark-bigquery-connector's Introduction

Apache Spark SQL connector for Google BigQuery

BigQuery Storage API

Direct Streaming

Filtering

Column Filtering

Predicate Filtering

Dynamic Sharding

Requirements

Enable the BigQuery Storage API

Create a Google Cloud Dataproc cluster (Optional)

Downloading and Using the Connector

Connector to Spark Compatibility Matrix

Connector to Dataproc Image Compatibility Matrix

Maven / Ivy Package Usage

Specifying the Spark BigQuery connector version in a Dataproc cluster

Hello World Example

Example Codelab

Usage

Reading data from a BigQuery table

Reading data from a BigQuery query

Reading From Views

Writing data to BigQuery

Direct write using the BigQuery Storage Write API

Indirect write

Properties

Data types

Spark ML Data Types Support

Numeric and BigNumeric support

Filtering

Partitioned Tables

Configuring Partitioning

Tagging BigQuery Resources

Adding BigQuery Jobs Labels

Adding BigQuery Storage Trace ID

Using in Jupyter Notebooks

Compiling against the connector

Maven

SBT

Connector metrics and how to view them

FAQ

What is the Pricing for the Storage API?

I have very few partitions

I get quota exceeded errors while writing

How do I authenticate outside GCE / Dataproc?

How do I connect to GCP/BigQuery via Proxy?

spark-bigquery-connector's People

Contributors

Stargazers

Watchers

Forkers

spark-bigquery-connector's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs