Link to DataprocPySparkBatchOp() : <a href="

How to configure `spark-bigquery` connector in KFP (Kubeflow) GCP DataProc operator which uses DataProc serverless under the hood? about spark-bigquery-connector HOT 1 CLOSED

gomrinal commented on June 13, 2024

How to configure `spark-bigquery` connector in KFP (Kubeflow) GCP DataProc operator which uses DataProc serverless under the hood?

from spark-bigquery-connector.

Comments (1)

gomrinal commented on June 13, 2024

So, I have resolved this issue. I am writing down my approach so that if anyone who is using vertex ai pipelines along with serverless dataproc wanna use the DataprocPySparkBatchOp() . They can easily do so!

History:
DataprocPySparkBatchOp() is a powerful operator which lets you spark-submit your Pyspark jobs to Dataproc serverless. However, as of Jan 10, 2024 , Dataproc serverless clusters doesn't come with spark-bigquery-connector by default therefore if you submit your pyspark job it will fail because it doesn't have access to spark-bigquery-connector.

On looking into documentation of Dataproc Serverless, parameters examples weren't clear so I had to do some brute force testing with different version of connector and spark.

Solution:

# Imports Dataproc pyspark batch op component from google_cloud_pipeline_components:
from google_cloud_pipeline_components.v1.dataproc import DataprocPySparkBatchOp

PROJECT_ID= 'your GCP project id'
LOCATION= 'us-central1' # just an example, you can change based on where you want to run your spark job
PYSPARK_FILE_URI = 'gs://<your gcs link to pyspark script'
SERVICE_ACCOUNT = ' you can put your custom service account here. Default would be compute engine SA! Ensure that SA has Dataproc Editor access'
# sample args (argument you want to pass to the script):
ARGS = [
    "--input",
    GCS_DATA_INPUT,
    "--output",
    GCS_DATA_OUTPUT,
]
# For more runtime config: https://cloud.google.com/dataproc-serverless/docs/concepts/versions/dataproc-serverless-versions

RUNTIME_CONFIG_VERSION='2.2'
 RUNTIME_CONFIG_PROPERTIES= {
        'dataproc.sparkBqConnector.version':'0.35.1',
        'dataproc.sparkBqConnector.uri':'gs://spark-lib/bigquery/spark-3.3-bigquery-0.35.1.jar'
    }

# List of jar file uris:
JAR_FILE_URIS = ['gs://spark-lib/bigquery/spark-3.3-bigquery-0.35.1.jar']

dataproc_op = DataprocPySparkBatchOp(
        project=PROJECT_ID,
        location=LOCATION,
        main_python_file_uri=PYSPARK_FILE_URI,
        service_account=SERVICE_ACCOUNT,
        args=ARGS,
        runtime_config_version=RUNTIME_CONFIG_VERSION,
        runtime_config_properties = RUNTIME_CONFIG_PROPERTIES,
        jar_file_uris=JAR_FILE_URIS,
        #batch_id=batch_id,  # `batch_id` is optional
    )

from spark-bigquery-connector.

Recommend Projects

How to configure `spark-bigquery` connector in KFP (Kubeflow) GCP DataProc operator which uses DataProc serverless under the hood? about spark-bigquery-connector HOT 1 CLOSED

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs