Comments (5)
I was the one that asked the SO question, and the notebook example helped me fix the issue.
It's probably enough to add something like:
To run this from within a Dataproc Jupyter instance, be sure to start a Python notebook (Not PySpark) and run:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('EDA')\
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar') \
.getOrCreate()
You can then load data as a spark dataframe through
s = spark.read \
.format('bigquery') \
.option('table', '{_project_}.{_database_}.{_table_name_}') \
.load()
from spark-bigquery-connector.
That said, I'm seeing terrible performance when I try to run anything.
Even something like
s = spark.read.format('bigquery').option('table', 'my_project.all_data.stores').load()
s.take(1)
Is taking minutes... and then crashing.
Also
s.columns
results in a crash.
That said, that's nothing to do with this issue. I'll do some googling and then ask again.
from spark-bigquery-connector.
A correction on the above, I ran
%%time
s.columns
And got an answer, but it took ~5minutes. This was not registered by the %%time command which clocked a wall time of 57.5 ยตs
from spark-bigquery-connector.
Just found that changing things to
s = spark.read \
.format('bigquery') \
.option('project', '{_project_name_}') \
.option('table', '{_database_}.{_table_name_}') \
.load()
resolved many of my issues, though
s.take(1)
results in
Server Connection Error
Invalid response: 504
Py4JJavaError: An error occurred while calling o119.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in
stage 1.0 [...]: java.lang.ClassNotFoundException:
com.google.cloud.spark.bigquery.direct.BigQueryPartition
from spark-bigquery-connector.
https://github.com/GoogleCloudDataproc/spark-bigquery-connector#using-in-jupyter-notebooks
from spark-bigquery-connector.
Related Issues (20)
- failed to save dataframe to bigquery with direct write mode HOT 6
- Best practice to deal with query parameters? HOT 1
- InvalidClassException with spark-bigquery-with-dependencies_2.12-0.34.0.jar, scala version 2.12, spark version 3.3.2 HOT 2
- Flakey behavior when writing to BigQuery HOT 2
- When writing to a BQ table with Integer-range partitioning it fails with complain about time partitioning HOT 2
- does spark read from bq multiple times when joining? HOT 2
- No enum constant com.google.cloud.spark.bigquery.PartitionOverwriteMode.dynamic HOT 2
- How many times a the bq connector hits a table in bigquery ? HOT 1
- Support spark 3.5 HOT 2
- Next release? HOT 1
- Bug: Enabling predicate pushdown fails HOT 2
- Unable to write in a BQ table, with the new spark connector update - issue persists HOT 9
- [question][hope-reply]: why here we use Java to implement spark-connector as for supporting spark datasourceV2 HOT 3
- Unable to write changed table to BigQuery with the new Spark Connector - 0.35.1 HOT 1
- How to configure `spark-bigquery` connector in KFP (Kubeflow) GCP DataProc operator which uses DataProc serverless under the hood? HOT 1
- Spark BQ connector doesn't work when reading table that is partitioned? HOT 5
- Support use of bigquery-emulator for integration testing
- Flakey behavior when writing to BigQuery HOT 4
- Clarification on Billing and Improved README.md Explanation HOT 1
- AWS Glue - Indirect write mode errors HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spark-bigquery-connector.