Comments (1)
So, I have resolved this issue. I am writing down my approach so that if anyone who is using vertex ai pipelines along with serverless dataproc wanna use the DataprocPySparkBatchOp()
. They can easily do so!
History:
DataprocPySparkBatchOp()
is a powerful operator which lets you spark-submit your Pyspark jobs to Dataproc serverless. However, as of Jan 10, 2024
, Dataproc serverless clusters doesn't come with spark-bigquery-connector by default therefore if you submit your pyspark job it will fail because it doesn't have access to spark-bigquery-connector.
On looking into documentation of Dataproc Serverless, parameters examples weren't clear so I had to do some brute force testing with different version of connector and spark.
Solution:
# Imports Dataproc pyspark batch op component from google_cloud_pipeline_components:
from google_cloud_pipeline_components.v1.dataproc import DataprocPySparkBatchOp
PROJECT_ID= 'your GCP project id'
LOCATION= 'us-central1' # just an example, you can change based on where you want to run your spark job
PYSPARK_FILE_URI = 'gs://<your gcs link to pyspark script'
SERVICE_ACCOUNT = ' you can put your custom service account here. Default would be compute engine SA! Ensure that SA has Dataproc Editor access'
# sample args (argument you want to pass to the script):
ARGS = [
"--input",
GCS_DATA_INPUT,
"--output",
GCS_DATA_OUTPUT,
]
# For more runtime config: https://cloud.google.com/dataproc-serverless/docs/concepts/versions/dataproc-serverless-versions
RUNTIME_CONFIG_VERSION='2.2'
RUNTIME_CONFIG_PROPERTIES= {
'dataproc.sparkBqConnector.version':'0.35.1',
'dataproc.sparkBqConnector.uri':'gs://spark-lib/bigquery/spark-3.3-bigquery-0.35.1.jar'
}
# List of jar file uris:
JAR_FILE_URIS = ['gs://spark-lib/bigquery/spark-3.3-bigquery-0.35.1.jar']
dataproc_op = DataprocPySparkBatchOp(
project=PROJECT_ID,
location=LOCATION,
main_python_file_uri=PYSPARK_FILE_URI,
service_account=SERVICE_ACCOUNT,
args=ARGS,
runtime_config_version=RUNTIME_CONFIG_VERSION,
runtime_config_properties = RUNTIME_CONFIG_PROPERTIES,
jar_file_uris=JAR_FILE_URIS,
#batch_id=batch_id, # `batch_id` is optional
)
from spark-bigquery-connector.
Related Issues (20)
- Spark BQ connector doesn't work when reading table that is partitioned? HOT 5
- Support use of bigquery-emulator for integration testing
- Flakey behavior when writing to BigQuery HOT 4
- Clarification on Billing and Improved README.md Explanation HOT 1
- AWS Glue - Indirect write mode errors HOT 3
- Schema mismatch error needs to be more verbose
- INVALID_ARGUMENT When attempting to show df from BigQuery HOT 3
- Load failure caused by comment at top of query string (llegalArgumentException: Invalid Table ID) HOT 1
- BigQueryConnectorException: Error creating destination table HOT 5
- Unable to overwrite partition HOT 3
- Map column of a complex type in values causes error "Data type not expected: struct<...>" HOT 1
- Table expiration with write() operation HOT 1
- Impersonate Service Account HOT 1
- Map type with Complex Value not supported any more HOT 1
- Direct writemethod not working in Databricks for Spark 3.5 HOT 5
- Idempotent write support in BQ
- JARs marked 'latest' not being updated HOT 1
- Automatically read JSON types
- Storage Read API logging HOT 5
- BigQuery Pushdown filtering on Spark 3.4.2 HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spark-bigquery-connector.