GithubHelp home page GithubHelp logo

How to configure `spark-bigquery` connector in KFP (Kubeflow) GCP DataProc operator which uses DataProc serverless under the hood? about spark-bigquery-connector HOT 1 CLOSED

gomrinal avatar gomrinal commented on June 13, 2024
How to configure `spark-bigquery` connector in KFP (Kubeflow) GCP DataProc operator which uses DataProc serverless under the hood?

from spark-bigquery-connector.

Comments (1)

gomrinal avatar gomrinal commented on June 13, 2024

So, I have resolved this issue. I am writing down my approach so that if anyone who is using vertex ai pipelines along with serverless dataproc wanna use the DataprocPySparkBatchOp() . They can easily do so!

History:
DataprocPySparkBatchOp() is a powerful operator which lets you spark-submit your Pyspark jobs to Dataproc serverless. However, as of Jan 10, 2024 , Dataproc serverless clusters doesn't come with spark-bigquery-connector by default therefore if you submit your pyspark job it will fail because it doesn't have access to spark-bigquery-connector.

On looking into documentation of Dataproc Serverless, parameters examples weren't clear so I had to do some brute force testing with different version of connector and spark.

Solution:

# Imports Dataproc pyspark batch op component from google_cloud_pipeline_components:
from google_cloud_pipeline_components.v1.dataproc import DataprocPySparkBatchOp

PROJECT_ID= 'your GCP project id'
LOCATION= 'us-central1' # just an example, you can change based on where you want to run your spark job
PYSPARK_FILE_URI = 'gs://<your gcs link to pyspark script'
SERVICE_ACCOUNT = ' you can put your custom service account here. Default would be compute engine SA! Ensure that SA has Dataproc Editor access'
# sample args (argument you want to pass to the script):
ARGS = [
    "--input",
    GCS_DATA_INPUT,
    "--output",
    GCS_DATA_OUTPUT,
]
# For more runtime config: https://cloud.google.com/dataproc-serverless/docs/concepts/versions/dataproc-serverless-versions

RUNTIME_CONFIG_VERSION='2.2'
 RUNTIME_CONFIG_PROPERTIES= {
        'dataproc.sparkBqConnector.version':'0.35.1',
        'dataproc.sparkBqConnector.uri':'gs://spark-lib/bigquery/spark-3.3-bigquery-0.35.1.jar'
    }

# List of jar file uris:
JAR_FILE_URIS = ['gs://spark-lib/bigquery/spark-3.3-bigquery-0.35.1.jar']

dataproc_op = DataprocPySparkBatchOp(
        project=PROJECT_ID,
        location=LOCATION,
        main_python_file_uri=PYSPARK_FILE_URI,
        service_account=SERVICE_ACCOUNT,
        args=ARGS,
        runtime_config_version=RUNTIME_CONFIG_VERSION,
        runtime_config_properties = RUNTIME_CONFIG_PROPERTIES,
        jar_file_uris=JAR_FILE_URIS,
        #batch_id=batch_id,  # `batch_id` is optional
    )

from spark-bigquery-connector.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.