Initially discussed here: <a class="issue-link js-issue-link" data-error-text="Failed

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

updated issue details: according to the <a href="https://github.com/GoogleCloudDatapro

If you're using Serverless 2.1, it comes with built in 0.28.1. See <a href="https://cl

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Sorry folks, have to open another issue: <a class="issue-link js-issue-link" data-erro

Unable to write in a BQ table, with the new spark connector update - issue persists about spark-bigquery-connector HOT 9 CLOSED

katerina-kogan commented on June 13, 2024

Unable to write in a BQ table, with the new spark connector update - issue persists

from spark-bigquery-connector.

Comments (9)

vishalkarve15 commented on June 13, 2024

@katerina-kogan Can you please share the exact steps and a code sample to reproduce the issue?

from spark-bigquery-connector.

katerina-kogan commented on June 13, 2024

@vishalkarve15 , please, kindly check the steps:

dbt cloud python model in fct_model.py:

import pyspark.sql.functions as F
from pyspark.sql.window import Window

TWO_YEARS_DAYS = 365*2

import pyspark.sql.functions as F
from pyspark.sql.window import Window

TWO_YEARS_DAYS = 365*2

def model(dbt, session):
    session.conf.set("viewsEnabled","true")

    table1_df = dbt.source("dataset1", "table1")
    table2_df = dbt.source("dataset1", "table2")
    table3_df = dbt.source("dataset2", "table3")
    table4_df = dbt.source("dataset2", "table4")
    table5_df = dbt.ref("ref_table")


   return table1_df \
      .join(table2_df, table1_df.o_id==table2_df.oo_id,"inner") \
      .join(table4_df, table4_df.o_id==table2_df.id,"inner") \
      .join(table3_df, table4_df.g_id==table3_df.id,"inner") \
      .join(table5_df, table4_df.status==table5_df.status_code,"left") \
      .filter(table1_df.created >= F.date_sub(F.current_date(), TWO_YEARS_DAYS)) \
      .filter(table4_df.created >= F.date_sub(F.current_date(), TWO_YEARS_DAYS)) \
      .withColumn("attempt_number", F.row_number().over(Window.partitionBy(table1_df.id).orderBy(table2_df.created_at))) \
      .withColumn("payment_type", F.coalesce(F.get_json_object(table4_df["extra"], "$.result.data.payment_type"), table3_df["payment_type"])) \
      .select(
          table2_df["customer_id"], 
          F.to_date(table1_df["created"]).alias("created"),
          F.col("attempt_number"),
          F.col("status_desc").alias("status"),
          F.col("payment_type")
          ).repartition(10)

Description of the model in model.yml :

version: 2

models:
- name: fct_model
    description: fct_model
    config:
        submission_method: cluster
    columns:
      - name: customer_id
        description: customer_id
      - name: created
        description: created
      - name: attempt_number
        description: attempt_number
      - name: status_desc
        description: status_desc
      - name: payment_type
        description: payment_type

The model runs on Dataproc Serverless via "dbt built" command in dbt cloud
The first run is successful, it creates a fct_model table in bigquery and adds data
Any subsequent run results in error: "Destination schema is not compatible"

P.S. as stated in the issue description, if columns section is not there in model.yml file above, the model runs fine. But once the section is added, model starts subsequently failing. Cant say it is 100% relevance, maybe unfortunate coincidence with something wrong going behind the scene

Thank you!

from spark-bigquery-connector.

katerina-kogan commented on June 13, 2024

updated issue details: according to the doc the Dataproc serverless uses built in connector.
I also tested on a cluster and specified connector version directly as doc suggested:
SPARK_BQ_CONNECTOR_VERSION=0.34.0
and it works fine

from spark-bigquery-connector.

vishalkarve15 commented on June 13, 2024

It works fine, you mean, this issue does not occur in the 0.34.0 connector? Even including the columns section ?

from spark-bigquery-connector.

katerina-kogan commented on June 13, 2024

I believe the issue is with built in connector, but the doc doesnt say which exact version it is

from spark-bigquery-connector.

vishalkarve15 commented on June 13, 2024

If you're using Serverless 2.1, it comes with built in 0.28.1. See https://cloud.google.com/dataproc-serverless/docs/concepts/versions/spark-runtime-versions

from spark-bigquery-connector.

katerina-kogan commented on June 13, 2024

Hi @vishalkarve15 , we have developed on Dataproc Cluster, Spark 3.3.2
https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.1

from spark-bigquery-connector.

vishalkarve15 commented on June 13, 2024

That one comes with 0.27.1. We have updated the documentation to reflect this yesterday.
I'm closing this since it has been fixed in the latest 0.34.0.
Feel free to reopen if you still face issues.

from spark-bigquery-connector.

katerina-kogan commented on June 13, 2024

Sorry folks, have to open another issue: #1158
Same error, but now happening in 0.35 version

from spark-bigquery-connector.

Unable to write in a BQ table, with the new spark connector update - issue persists about spark-bigquery-connector HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs