GithubHelp home page GithubHelp logo

absaoss / abris Goto Github PK

View Code? Open in Web Editor NEW
222.0 18.0 73.0 988 KB

Avro SerDe for Apache Spark structured APIs.

License: Apache License 2.0

Scala 100.00%
avro-schema schema-registry spark kafka avro

abris's Introduction

ABRiS - Avro Bridge for Spark

  • Pain free Spark/Avro integration.

  • Seamlessly integrate with Confluent platform, including Schema Registry with all available naming strategies and schema evolution.

  • Seamlessly convert your Avro records from anywhere (e.g. Kafka, Parquet, HDFS, etc) into Spark Rows.

  • Convert your Dataframes into Avro records without even specifying a schema.

  • Go back-and-forth Spark Avro (since Spark 2.4).

Coordinates for Maven POM dependency

Scala Abris
2.11 Maven Central
2.12 Maven Central
2.13 Maven Central

Supported versions

Abris Spark Scala
6.2.0 - 6.x.x 3.2.1 - 3.5.x 2.12 / 2.13
6.0.0 - 6.1.1 3.2.0 2.12 / 2.13
5.0.0 - 5.x.x 3.0.x / 3.1.x 2.12
5.0.0 - 5.x.x 2.4.x 2.11 / 2.12

From version 6.0.0, ABRiS only supports Spark 3.2.x.

ABRiS 5.0.x is still supported for older versions of Spark (see branch-5)

Older Versions

This is documentation for Abris version 6. Documentation for older versions is located in corresponding branches: branch-5, branch-4, branch-3.2.

Confluent Schema Registry Version

Abris by default uses Confluent client version 6.2.0.

Installation

Abris needs spark-avro to run, make sure you include the spark-avro dependency when using Abris. The version of spark-avro and Spark should be identical.

Example: submitting a Spark job:

./bin/spark-submit \
    --packages org.apache.spark:spark-avro_2.12:3.5.0,za.co.absa:abris_2.12:6.4.0 \
    ...rest of submit params...

Example: using Abris in maven project:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.12</artifactId>
    <version>3.5.0</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-avro_2.12</artifactId>
    <version>3.5.0</version> <!-- version must be the same as Spark -->
</dependency>
<dependency>
    <groupId>za.co.absa</groupId>
    <artifactId>abris_2.12</artifactId>
    <version>6.4.0</version>
</dependency>

Example: using Abris in SBT project:

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "3.5.0" % Provided,
  "org.apache.spark" %% "spark-avro" % "3.5.0",
  "za.co.absa" %% "abris" % "6.4.0"
)

Usage

ABRiS API is in it's most basic form almost identical to Spark built-in support for Avro, but it provides additional functionality. Mainly it's support of schema registry and also seamless integration with confluent Avro data format.

The API consists of two Spark SQL expressions (to_avro and from_avro) and fluent configurator (AbrisConfig)

Using the configurator you can choose from four basic config types:

  • toSimpleAvro, toConfluentAvro, fromSimpleAvro and fromConfluentAvro

And configure what you want to do, mainly how to get the avro schema.

Example of usage:

val abrisConfig = AbrisConfig
  .fromConfluentAvro
  .downloadReaderSchemaByLatestVersion
  .andTopicNameStrategy("topic123")
  .usingSchemaRegistry("http://localhost:8081")

import za.co.absa.abris.avro.functions.from_avro
val deserialized = dataFrame.select(from_avro(col("value"), abrisConfig) as 'data)

Detailed instructions for many use cases are in separated documents:

Full runnable examples can be found in the za.co.absa.abris.examples package. You can also take a look at unit tests in package za.co.absa.abris.avro.sql.

IMPORTANT: Spark dependencies have provided scope in the pom.xml, so when running the examples, please make sure that you either, instruct your IDE to include dependencies with provided scope, or change the scope directly.

Confluent Avro format

The format of Avro binary data is defined in Avro specification. Confluent format extends it and prepends the schema id before the actual record. The Confluent expressions in this library expect this format and add the id after the Avro data are generated or remove it before they are parsed.

You can find more about Confluent and Schema Registry in Confluent documentation.

Schema Registry security and other additional settings

Only Schema registry client setting that is mandatory is the url, but if you need to provide more the configurer allows you to provide a whole map.

For example, you may want to provide basic.auth.user.info and basic.auth.credentials.source required for user authentication. You can do it this way:

val registryConfig = Map(
  AbrisConfig.SCHEMA_REGISTRY_URL -> "http://localhost:8081",
  "basic.auth.credentials.source" -> "USER_INFO",
  "basic.auth.user.info" -> "srkey:srvalue"
)

val abrisConfig = AbrisConfig
  .fromConfluentAvro
  .downloadReaderSchemaByLatestVersion
  .andTopicNameStrategy("topic123")
  .usingSchemaRegistry(registryConfig) // use the map instead of just url

Other Features

Generating Avro schema from Spark data frame column

There is a helper method that allows you to generate schema automatically from spark column. Assuming you have a data frame containing column "input". You can generate schema for data in that column like this:

val schema = AvroSchemaUtils.toAvroSchema(dataFrame, "input")

Using schema manager to directly download or register schema

You can use SchemaManager directly to do operations with schema registry. The configuration is identical to Schema Registry Client. The SchemaManager is just a wrapper around the client providing helpful methods and abstractions.

val schemaRegistryClientConfig = Map( ...configuration... )
val schemaManager = SchemaManagerFactory.create(schemaRegistryClientConfig)

// Downloading schema:
val schema = schemaManager.getSchemaById(42)

// Registering schema:
val schemaString = "{...avro schema json...}"
val subject = SchemaSubject.usingTopicNameStrategy("fooTopic")
val schemaId = schemaManager.register(subject, schemaString)

// and more, check SchemaManager's methods

De-serialisation Error Handling

There are 2 ways ABRiS handles de-serialisation errors:

FailFast (Default)

Given no provided de-serialisation handler, a failure will result in a spark exception being thrown and with the error being outputted. This is the default procedure.

SpecificRecordHandler

The second option requires providing a default record that will be outputted in the event of a failure. This should be used as a flag to be deleted outside ABRiS that should mean the spark job will not stop. Beware however, a null or empty record will also result in an error so a record with a different input should be chosen.

This can be provided as such:

val abrisConfig = AbrisConfig
  .fromConfluentAvro
  .downloadReaderSchemaByLatestVersion
  .andTopicNameStrategy("topic123")
  .usingSchemaRegistry(registryConfig)
  .withSchemaConverter("custom")
  .withExceptionHandler(new SpecificRecordExceptionHandler(providedDefaultRecord))

This is only for confluent-based configuration, not for standard avro.

PermissiveRecordExceptionHandler

The third option is to use the PermissiveRecordExceptionHandler. In case of a deserialization failure, this handler replaces the problematic record with a fully null record, instead of throwing an exception. This allows the data processing pipeline to continue without interruption.

The main use case for this option is when you want to prioritize continuity of processing over individual record integrity. It's especially useful when dealing with large datasets where occasional malformed records could be tolerated.

Here's how to use it:

val abrisConfig = AbrisConfig
  .fromConfluentAvro
  .downloadReaderSchemaByLatestVersion
  .andTopicNameStrategy("topic123")
  .usingSchemaRegistry(registryConfig)
  .withSchemaConverter("custom")
  .withExceptionHandler(new PermissiveRecordExceptionHandler())

With this configuration, in the event of a deserialization error, the PermissiveRecordExceptionHandler will log a warning, substitute the malformed record with a fully null one, and allow the data processing pipeline to continue.

Data Conversions

This library also provides convenient methods to convert between Avro and Spark schemas.

If you have an Avro schema which you want to convert into a Spark SQL one - to generate your Dataframes, for instance - you can do as follows:

val avroSchema: Schema = AvroSchemaUtils.load("path_to_avro_schema")
val sqlSchema: StructType = SparkAvroConversions.toSqlType(avroSchema) 

You can also do the inverse operation by running:

val sqlSchema = new StructType(new StructField ....
val avroSchema = SparkAvroConversions.toAvroSchema(sqlSchema, avro_schema_name, avro_schema_namespace)

Custom data conversions

If you would like to use custom logic to convert from Avro to Spark, you can implement the SchemaConverter trait. The custom class is loaded in ABRiS using the service provider interface (SPI), so you need to register your class in your META-INF/services resource directory. You can then configure the custom class with its short name or the fully qualified name.

Example

Custom schema converter implementation

package za.co.absa.abris.avro.sql
import org.apache.avro.Schema
import org.apache.spark.sql.types.DataType

class CustomSchemaConverter extends SchemaConverter {
  override val shortName: String = "custom"
  override def toSqlType(avroSchema: Schema): DataType = ???
}

Provider configuration file META-INF/services/za.co.absa.abris.avro.sql.SchemaConverter:

za.co.absa.abris.avro.sql.CustomSchemaConverter

Abris configuration

val abrisConfig = AbrisConfig
  .fromConfluentAvro
  .downloadReaderSchemaByLatestVersion
  .andTopicNameStrategy("topic123")
  .usingSchemaRegistry(registryConfig)
  .withSchemaConverter("custom")

Multiple schemas in one topic

The naming strategies RecordName and TopicRecordName allow for a one topic to receive different payloads, i.e. payloads containing different schemas that do not have to be compatible, as explained here.

When you read such data from Kafka they will be stored as binary column in a dataframe, but once you convert them to Spark types they cannot be in one dataframe, because all rows in dataframe must have the same schema.

So if you have multiple incompatible types of avro data in a dataframe you must first sort them out to several dataframes. One for each schema. Then you can use Abris and convert the avro data.

How to measure code coverage

./mvn clean verify -Pcode-coverage,scala-2.12
or
./mvn clean verify -Pcode-coverage,scala-2.13

Code coverage reports will be generated on paths:

{local-path}\ABRiS\target\jacoco

Copyright 2018 ABSA Group Limited

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

abris's People

Contributors

algorri94 avatar cerveada avatar danwertheimer avatar dinedal avatar felipemmelo avatar froesef avatar georgichochov avatar jhsb25 avatar kevinwallimann avatar kylelmiller avatar miroslavpojer avatar nastasiasaby avatar ni-mi avatar scaddingj avatar sheelc avatar strabox avatar timvw avatar willianmrs avatar yokhan-dukhin avatar zejnilovic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

abris's Issues

please publish for scala version

i would expect the dependency to be:
organization za.co.absa
package abris_2.11
version 2.0.0

and artifact pom path for example:
za/co/absa/abris_2.11/2.0.0/abris_2.11-2.0.0.pom

soon spark will be published for scala 2.11 and scala 2.12 and then the current scheme of not appending scala version will become an issue.

thanks!
koert

Consuming from Kafka topic

Hello,

I have a setup in EMR where I'm consuming from a kafka topic using spark structured streaming in batches. I noticed in the logs there is a repeated message between the beginning and the end of the stage of each cycle.
The message is:

Creating adapter for strategy: topic.name

This message gets repeated and then there is a gap in the log time just when the cycle is done and the following message is displayed:

Executor: Finished task #.# in stage ##.#. #### bytes result sent to driver

My goal is to tune up the performance of extracting from a kafka topic. And the time gap between the beginning and the end of that message is what takes the longest.
Can you please explain to me what is happening when that message is repeated?
Also, what are some techniques to tune up the performance of the consumer when there is a schema registry running with avro format?

Thank you

What ABRiS could do that it couldn't do databricks at that moment?

Hello, I saw your ABRiS conference on the Databricks channel.

https://databricks.com/session/abris-avro-bridge-for-apache-spark

At the conference you explained what ABRiS could do that it couldn't do Databricks at that moment.

It's been more than a year since then.

Please could you write, which gap is still offering ABRiS to current date (mainly as far as confluent is concerned)

It would be very useful, as I believe that both ABRiS and Databricks have evolved since then.

Thank you

Previous record information pulled to next record for null columns

Did anyone faced any issue of previous record information getting pulled to current record in case of null columns using from_confluent_avro method. Source team is stating the avro pushed doesn't contain the those information, however when the method is used the previous record information is being populated for current record.

[WARNING] Avro: Invalid default for field metadata: "null" not a ["null"

Hi,
We have a use case where we need to consume avro data from kafka topic and ingest into hdfs using spark 2.2 version. I have followed all the instructions which you mentioned in the readme page. I am getting below warning started failing.
[WARNING] Avro: Invalid default for field metadata: "null" not a ["null"

java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.sql.catalyst.analysis.TypeCoercion$$anonfun$1$$anonfun$apply$12.apply(TypeCoercion.scala:107)
at org.apache.spark.sql.catalyst.analysis.TypeCoercion$$anonfun$1$$anonfun$apply$12.apply(TypeCoercion.scala:102)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.sql.catalyst.analysis.TypeCoercion$$anonfun$1.apply(TypeCoercion.scala:102)
at org.apache.spark.sql.catalyst.analysis.TypeCoercion$$anonfun$1.apply(TypeCoercion.scala:82)
at org.apache.spark.sql.catalyst.analysis.TypeCoercion$.findWiderTypeForTwo(TypeCoercion.scala:150)
at org.apache.spark.sql.catalyst.analysis.TypeCoercion$IfCoercion$$anonfun$apply$9.applyOrElse(TypeCoercion.scala:649)
at org.apache.spark.sql.catalyst.analysis.TypeCoercion$IfCoercion$$anonfun$apply$9.applyOrElse(TypeCoercion.scala:645)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:258)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:258)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:279)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:289)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:293)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:293)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:298)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:298)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:258)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:249)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveExpressions$1.applyOrElse(LogicalPlan.scala:80)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveExpressions$1.applyOrElse(LogicalPlan.scala:79)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveExpressions(LogicalPlan.scala:79)
at org.apache.spark.sql.catalyst.analysis.TypeCoercion$IfCoercion$.apply(TypeCoercion.scala:645)
at org.apache.spark.sql.catalyst.analysis.TypeCoercion$IfCoercion$.apply(TypeCoercion.scala:644)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
at org.apache.spark.sql.Dataset.(Dataset.scala:164)
at org.apache.spark.sql.Dataset.(Dataset.scala:170)
at org.apache.spark.sql.Dataset.mapPartitions(Dataset.scala:2253)
at za.co.absa.abris.avro.serde.AvroDecoder.fromAvroToRow(AvroDecoder.scala:313)
at za.co.absa.abris.avro.serde.AvroDecoder.fromAvroToRow(AvroDecoder.scala:326)
at za.co.absa.abris.avro.AvroSerDe$StreamDeserializer.fromAvro(AvroSerDe.scala:127)
... 52 elided

Code:

export JAR_FILES=/home/madhuhad04/config-1.3.1.jar,/home/madhuhad04/spark-sql-kafka-0-10_2.11-2.2.0.jar,/home/madhuhad04/spark-streaming-kafka-0-10-assembly_2.11-2.2.1.jar,/home/madhuhad04/abris_2.11-2.2.2.jar,/home/madhuhad04/spark-avro_2.11-4.0.0.jar

export SPARK_MAJOR_VERSION=2

spark-shell --verbose --conf spark.ui.port=4096 --jars ${JAR_FILES}

import za.co.absa.abris.avro.AvroSerDe._
import za.co.absa.abris.avro.read.confluent.SchemaManager
import za.co.absa.abris.avro.schemas.policy.SchemaRetentionPolicies._

import org.apache.spark.sql.avro._

val stream = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "test-topic").fromAvro("value", "/db/madhuhad04/test.avsc")(RETAIN_SELECTED_COLUMN_ONLY)

Can you please help me to resolve this issue.

Dependency conflict for Spark 2.4.*

Hi,

I have the following NoSuchMethodError when using Spark 2.4.3, Scala 2.11, Avro 1.8.2 and abris_2_11 3.0.0 and attempting to call from_confluent_avro()

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.boxedType(Lorg/apache/spark/sql/types/DataType;)Ljava/lang/String;
	at za.co.absa.abris.avro.sql.AvroDataToCatalyst$$anonfun$doGenCode$1.apply(AvroDataToCatalyst.scala:75)
	at za.co.absa.abris.avro.sql.AvroDataToCatalyst$$anonfun$doGenCode$1.apply(AvroDataToCatalyst.scala:73)
	at org.apache.spark.sql.catalyst.expressions.UnaryExpression.nullSafeCodeGen(Expression.scala:437)
	at za.co.absa.abris.avro.sql.AvroDataToCatalyst.doGenCode(AvroDataToCatalyst.scala:73)
	at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
	at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
	at org.apache.spark.sql.catalyst.expressions.Alias.genCode(namedExpressions.scala:155)
	at org.apache.spark.sql.execution.ProjectExec$$anonfun$6.apply(basicPhysicalOperators.scala:60)
	at org.apache.spark.sql.execution.ProjectExec$$anonfun$6.apply(basicPhysicalOperators.scala:60)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:60)
	at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:189)
	at org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:374)
	at org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:403)
	at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:90)
	at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:85)
	at org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:374)
	at org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:45)
	at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:90)
	at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:85)
	at org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:35)
	at org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:544)
	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:598)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.columnar.CachedRDDBuilder.buildBuffers(InMemoryRelation.scala:83)
	at org.apache.spark.sql.execution.columnar.CachedRDDBuilder.cachedColumnBuffers(InMemoryRelation.scala:59)
	at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.filteredCachedBatches(InMemoryTableScanExec.scala:276)
	at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.inputRDD$lzycompute(InMemoryTableScanExec.scala:105)
	at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.inputRDD(InMemoryTableScanExec.scala:104)
	at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.doExecute(InMemoryTableScanExec.scala:310)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:391)
	at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3383)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
	at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2544)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2758)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:747)
	at io.mbition.dc.Job.main(Job.java:114)

Dependency conflict when using with spark 2.3.x

This is occurs when using version 3.0.0 of ABRiS in scala.

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
	at org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:66)
	at org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105)
	at org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82)
	at org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:81)
	at org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105)
	at org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82)
	at org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:81)
	at org.apache.spark.sql.avro.SchemaConverters$.toSqlType(SchemaConverters.scala:46)
	at za.co.absa.abris.avro.format.SparkAvroConversions$.toSqlType(SparkAvroConversions.scala:106)
	at za.co.absa.abris.avro.sql.AvroDataToCatalyst.dataType$lzycompute(AvroDataToCatalyst.scala:43)
	at za.co.absa.abris.avro.sql.AvroDataToCatalyst.dataType(AvroDataToCatalyst.scala:43)
	at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:163)
	at org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:51)
	at org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:51)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at org.apache.spark.sql.catalyst.plans.logical.Project.output(basicLogicalOperators.scala:51)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:157)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
	at 

scalac: error while loading <root>

Hi,
im using scala 2.11 version and spark 2.4.3,
while running the project im getting this error,

Error:scalac: error while loading , Error accessing /home/dev/.m2/repository/io/confluent/kafka-avro-serializer/5.1.0/kafka-avro-serializer-5.1.0.jar

Please help to resolve this,

How to debug org.apache.kafka.common.errors.SerializationException

I have been trying to read a kafka avro confluent message using Spark Stream and I keep getting "org.apache.kafka.common.errors.SerializationException". Not sure how I can debug it further to understand which field is having an issue. Or is there is a way, I can bypass schema validation and just deserializes the byte array and print it?

import org.apache.spark.sql.functions.col
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
import za.co.absa.abris.avro.read.confluent.SchemaManager

object ConfluentKafkaAvroReaderWithKey extends App {


  private val topic = "test-topic"
  //
  private val kafkaUrl = "localhost:9092"
  private val schemaRegistryUrl = "http://localhost:8081"


  val spark = SparkSession.builder.master("local[*]").appName("TestKafkaRead")
    .getOrCreate()
  spark.sparkContext.setLogLevel("WARN")

 
  val kafkaDataFrame = spark.readStream.format("kafka")
    .option("kafka.bootstrap.servers", kafkaUrl)
    .option("subscribe", topic)
    .option("startingOffsets", "latest")
    .load()


  val deserialized = configureExample(kafkaDataFrame)

  // YOUR OPERATIONS CAN GO HERE

  deserialized.printSchema()

  deserialized
    .writeStream
    .format("console")
    .option("truncate", "false")
    .start()
    .awaitTermination()


  private def configureExample(dataFrame: DataFrame): Dataset[Row] = {

    val commonRegistryConfig = Map(
      SchemaManager.PARAM_SCHEMA_REGISTRY_TOPIC -> "test-qa",
      SchemaManager.PARAM_SCHEMA_REGISTRY_URL -> schemaRegistryUrl
    //  SchemaManager.PARAM_SCHEMA_NAME_FOR_RECORD_STRATEGY -> "Envelope",
     // SchemaManager.PARAM_SCHEMA_NAMESPACE_FOR_RECORD_STRATEGY -> "challenges2_slave.challenges2_qa.challenges"
    )

    val valueRegistryConfig = commonRegistryConfig ++ Map(
      SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY -> "topic.name",
      SchemaManager.PARAM_VALUE_SCHEMA_ID -> "latest"
    )

    val keyRegistryConfig = commonRegistryConfig ++ Map(
      SchemaManager.PARAM_KEY_SCHEMA_NAMING_STRATEGY -> "topic.name",
      SchemaManager.PARAM_KEY_SCHEMA_ID -> "latest"
    )

    import za.co.absa.abris.avro.functions.from_confluent_avro


    dataFrame.select(
    //  from_confluent_avro(col("key"), keyRegistryConfig) as 'key,
      from_confluent_avro(col("value"), valueRegistryConfig) as 'value)
      .select("value.after.*")

  }

}

maven

I'm interested in trying your library, but I can't find a maven entry for abbris. I also don't see any directions on using the library instead of maven

java.util.NoSuchElementException: key not found: schema.name with TOPIC_NAME strategy

Using the following config which I thought should not require a schema.name to be set, but getting the below error anyway.

val commonRegistryConfig: Map[String, String] = Map(
        SchemaManager.PARAM_SCHEMA_REGISTRY_URL -> SchemaRegistryUrl,
        SchemaManager.PARAM_SCHEMA_REGISTRY_TOPIC -> OutputTopic
    )
    
  val keyRegistryConfig: Map[String, String] = commonRegistryConfig +
    (
        SchemaManager.PARAM_KEY_SCHEMA_NAMING_STRATEGY -> SchemaManager.SchemaStorageNamingStrategies.TOPIC_NAME,
        SchemaManager.PARAM_VALUE_SCHEMA_ID -> "latest")

  val valueRegistryConfig: Map[String, String] = commonRegistryConfig +
    (
        SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY -> SchemaManager.SchemaStorageNamingStrategies.TOPIC_NAME,
        SchemaManager.PARAM_VALUE_SCHEMA_ID -> "latest")
java.util.NoSuchElementException: key not found: schema.name
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at za.co.absa.abris.avro.functions$.to_confluent_avro(functions.scala:140)

to_confluent_avro low performance + warnings: Schema Registry client is already configured.

I'm consuming messages from kafka topic in confluent cloud using structured streaming and abris 3.1.0.
While messages come fine the log is full of the following warnings:

WARN 2019-11-28 13:32:48,137 14137 za.co.absa.abris.avro.read.confluent.SchemaManager [Executor task launch worker for task 1]

Also the performance of this pipeline is quite low: ~10 records/s on a single worker of 4 cores with 14G RAM

Am I doing it the wrong way? Any way to avoid this warning and how does it affect the performance?

Here's the code reading from in topic, decoding/encoding, and writing to out topic:

  val schemaRegistryConfigIn = Map(
    SchemaManager.PARAM_SCHEMA_REGISTRY_URL          -> "...",
    SchemaManager.PARAM_SCHEMA_REGISTRY_TOPIC        -> "in",
    SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY -> SchemaManager.SchemaStorageNamingStrategies.TOPIC_NAME,
    SchemaManager.PARAM_VALUE_SCHEMA_ID              -> "latest", // set to "latest" if you want the latest schema version to used
    "basic.auth.credentials.source"                  -> "USER_INFO",
    "schema.registry.basic.auth.user.info"           -> "..."
  )
  val schemaRegistryConfigOut = schemaRegistryConfigIn + (SchemaManager.PARAM_SCHEMA_REGISTRY_TOPIC -> "out")

  val inputDF = spark.readStream
    .format("kafka")
    .option("kafka.ssl.endpoint.identification.algorithm", "https")
    .option("kafka.sasl.mechanism", "PLAIN")
    .option("kafka.request.timeout.ms", "20000")
    .option("kafka.bootstrap.servers", broker)
    .option("kafka.retry.backoff.ms", "500")
    .option(
      "kafka.sasl.jaas.config",
      "..."
    )
    .option("kafka.security.protocol", "SASL_SSL")
    .option("subscribe", inputTopic)
    .option("startingOffsets", startingOffsetsValue)
    .load()

  val outputDF = inputDF
    .select(from_confluent_avro(col("value"), schemaRegistryConfigIn) as "parsed_message")
    .select("parsed_message.*")
    .select(
      to_confluent_avro(struct("firstname", "lastname", "country"), schemaRegistryConfigOut) as "value"
    )

  val query = outputDF.writeStream
    .format("kafka")
    .option("checkpointLocation", pathCheckpoint)
    .option("kafka.ssl.endpoint.identification.algorithm", "https")
    .option("kafka.sasl.mechanism", "PLAIN")
    .option("kafka.request.timeout.ms", "20000")
    .option("kafka.bootstrap.servers", broker)
    .option("kafka.retry.backoff.ms", "500")
    .option(
      "kafka.sasl.jaas.config",
      "..."
    )
    .option("kafka.security.protocol", "SASL_SSL")
    .option("topic", outputTopic)
    .start()

Similar to 'from_json(Column)' SparkSQL UDF, implement 'from_confluent_avro(Column)'

From the earlier issues, I saw that both keys and values should be able to be deserialized as Avro (I think this was the initial design, or only the values are extracted? Didn't dig too deep there).

Then there was discussion about keys are not Avro, and the values are Avro, which lead to funky classes like ConfluentKafkaAvroWriterWithPlainKey

As per discussion in #6 , I mentioned, what if I have a "plain value" and an avro key? What if my keys or values are some other datatype, not just strings and the other field is Avro?

Rather than create one-off methods for each combination of supported Avro data-types, I feel like a better way to implement encoders/decoders would be to create individual UDF-type functions such as the existing from_json(Column, Map[String, String]) to to_json(Column, Map[String, String]) functions in Spark, where for Avro support, those option maps would include at least the schema registry url.

From a usability perspective, I would expect this to work if I was doing "word count" on a topic with an Avro key.

df.select(from_confluent_avro(col("key")), col("value").cast("int"))

pyspark issue while deserilization

My code works fine in local environment. but same this fails with below mentioned error on my cluster environment.
Let me know what I am missing over here.

Environment :- Pyspark 2.4.0
Spark session command :-
spark = SparkSession
.builder
.appName(app_name)
.config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0,za.co.absa:abris_2.11:3.0.2,mysql:mysql-connector-java:8.0.16,org.apache.avro:avro:1.9.1')
.config('spark.jars.repositories','http://packages.confluent.io/maven')\
.enableHiveSupport()
.getOrCreate()

Error :-
Caused by: org.apache.spark.SparkException: Malformed records are detected in record parsing.
at za.co.absa.abris.avro.sql.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:70)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at za.co.absa.abris.avro.sql.AvroDataToCatalyst.getWriterSchema(AvroDataToCatalyst.scala:107)
at za.co.absa.abris.avro.sql.AvroDataToCatalyst.decodeConfluentAvro(AvroDataToCatalyst.scala:101)
at za.co.absa.abris.avro.sql.AvroDataToCatalyst.decode(AvroDataToCatalyst.scala:83)
at za.co.absa.abris.avro.sql.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:61)
... 16 more

Python binding results in RuntimeException from org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType

We are calling your library from python like mentioned in this StackOverflow post.

I have changed the function to use the schema_path and it looks as follows:

def expand_avro_local(spark_context, sql_context, data_frame, avsc_file_path ):
    j = spark_context._gateway.jvm
    dataframe_deserializer = j.za.co.absa.abris.avro.AvroSerDe.DataframeDeserializer(data_frame._jdf)
    schemaPath = j.scala.Some.apply(avsc_file_path)
    conf = j.scala.Option.apply(None)
    policy = getattr(j.za.co.absa.abris.avro.schemas.policy.SchemaRetentionPolicies, "RETAIN_SELECTED_COLUMN_ONLY$")()
    data_frame = dataframe_deserializer.fromConfluentAvro("value", schemaPath, conf, policy)
    data_frame = DataFrame(data_frame, sql_context)
    return data_frame
    
raw_df = apply_kafka_settings(spark_session.read).load().limit(100).cache()

raw_df.schema # StructType(List(StructField(value,BinaryType,true)))

extract_df = expand_avro_local(sc, spark_session, raw_df, transaction_schema_path)

extract_df.schema # StructType(List(StructField(operations,ArrayType(BinaryType,false),false)))

extract_df.count() # 100

extract_df.show()

Everything up to the last line works, but then the following exception is thrown:

19/06/26 13:43:24 ERROR Executor: Exception in task 0.0 in stage 31.0 (TID 16)
java.lang.RuntimeException: java.nio.HeapByteBuffer is not a valid external type for schema of binary
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.serializefromobject_doConsume_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
19/06/26 13:43:24 WARN TaskSetManager: Lost task 0.0 in stage 31.0 (TID 16, localhost, executor driver): java.lang.RuntimeException: java.nio.HeapByteBuffer is not a valid external type for schema of binary
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.serializefromobject_doConsume_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

This error seems super weird to me, because a ByteBuffer should be castable to a binary field.
I found that the check happens in org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType but somehow fails.

Does anyone have an idea what might be causing this? Am I doing the python binding wrong somehow, or is it a deeper issue? Thanks!

Too much logging from the app

Hello Abris team, We're using your API in our streaming app. It's really helpful, thanks for it. We're using "za.co.absa:abris_2.11:3.0.3".

However, we're seeing huge logs being stored in temp/ location. We dont want to change at the Spark basic config level since its the only application which is causing this issue and others are fine.
Can you help us with how to suppress these logs at the application level?

99% of the data is this:

20/02/12 18:53:52 INFO serializers.KafkaAvroDeserializerConfig: KafkaAvroDeserializerConfig values:
schema.registry.url = [http://caution-pc-sr.sitescout.com:8091]
basic.auth.user.info = [hidden]
auto.register.schemas = true
max.schemas.per.subject = 1000
basic.auth.credentials.source = URL
schema.registry.basic.auth.user.info = [hidden]
specific.avro.reader = false
value.subject.name.strategy = class io.confluent.kafka.serializers.subject.TopicNameStrategy
key.subject.name.strategy = class io.confluent.kafka.serializers.subject.TopicNameStrategy

482 [GB] /tmp/logs/x/logs/application_1580249944747_7589
483 [GB] /tmp/logs/x/logs/application_1580249944747_7778
486 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4377
488 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_7204
559 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4172
563 [GB] /tmp/logs/x/logs/application_1580249944747_6584
568 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_6157
573 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_7767
577 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4299
577 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4520
577 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_7960
579 [GB] /tmp/logs/x/logs/application_1580249944747_6489
579 [GB] /tmp/logs/x/logs/application_1580249944747_7133
582 [GB] /tmp/logs/x/logs/application_1580249944747_7567
595 [GB] /tmp/logs/x/logs/application_1580249944747_4432
597 [GB] /tmp/logs/x/logs/application_1580249944747_5642
597 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_6914
598 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_6856
599 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4348
600 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4919
600 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_6500
601 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_6796
602 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_7229
602 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_7297
602 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_7442
603 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_3780
603 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_3810
603 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_7509
604 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_7366
607 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4642
607 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_5042
607 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_7631
608 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4648
608 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4657
608 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4698
608 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4711
608 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4719
608 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4763
608 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_5684
608 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_5703
608 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_5708
608 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_5749
608 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_6691
609 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4224
609 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_5767
609 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_5776
609 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_5817
609 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_5828
609 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_5842
609 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_5889
609 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_5896
609 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_5916
609 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_5955
609 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_6555
610 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4233
610 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4268
610 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4289
610 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4357
611 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4106
611 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4441
611 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4827
611 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_4881
611 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_7833
611 [GB] /tmp/logs/bdbi_preprod_dev/logs/application_1580249944747_7871
612 [GB] /tmp/logs/x/logs/application_1580249944747_4818
612 [GB] /tmp/logs/x/logs/application_1580249944747_4846
612 [GB] /tmp/logs/x/logs/application_1580249944747_6173
612 [GB] /tmp/logs/x/logs/application_1580249944747_6509

library dependencies error

Hey, guys, there's trouble with the next library dependencies error

You can check and/or update abris to fix it.

sparkVersion = "2.4.4"
scalaVersion := "2.12.10"
sbt.version=1.3.3

[error] Caused by: com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson version: 2.9.9-3

Question about `AvroDataToCatalyst. decodeConfluentAvro ` behavior

I'd like to ask if providing schema id / "latest" when doing Confluent Avro is really necessary?

In https://github.com/AbsaOSS/ABRiS/blob/master/src/main/scala/za/co/absa/abris/avro/sql/AvroDataToCatalyst.scala#L104 it's required to provide reader schema.

But when you inspect Confluent Avro deserializer you stumble upon this https://github.com/confluentinc/schema-registry/blob/master/avro-serializer/src/main/java/io/confluent/kafka/serializers/AbstractKafkaAvroDeserializer.java#L190-L195 which seems to use writer schema for reader schema whenever reader schema is not provided.

Reason why I'm asking about this is that I do not want to tell my spark streaming jobs which reader schema it should use (via providing schema id or "latest") because I'd like to use schema id that is embedded into Confluent Avro payload itself.

Memory leak due to usage of IdentityHashMap in CachedSchemaRegistryClient

ABRiS/pom.xml

Line 59 in ec704f5

<confluent.version>5.1.0</confluent.version>

Hi,
I have been using this library it's quite handy & useful. While we observed a critical bug that is causing due to usage of IdentityHashMap in CachedSchemaRegistryClient on this schema-registry version . Confluent already fixed this issue by using HashMap. You will have to upgrade the confluent.version to overcome this issue.

Reference
Issue ticket
source code

Error Trace
java.lang.IllegalStateException: Too many schema objects created for topic-value! at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.register(CachedSchemaRegistryClient.java:153) at za.co.absa.abris.avro.read.confluent.SchemaManager$.register(SchemaManager.scala:199) at za.co.absa.abris.avro.parsing.utils.AvroSchemaUtils$.registerIfCompatibleSchema(AvroSchemaUtils.scala:95) at za.co.absa.abris.avro.parsing.utils.AvroSchemaUtils$.registerIfCompatibleValueSchema(AvroSchemaUtils.scala:79) at za.co.absa.abris.avro.parsing.utils.AvroSchemaUtils$.registerSchema(AvroSchemaUtils.scala:132) at za.co.absa.abris.avro.sql.CatalystDataToAvro.za$co$absa$abris$avro$sql$CatalystDataToAvro$$registerSchema(CatalystDataToAvro.scala:76) at za.co.absa.abris.avro.sql.CatalystDataToAvro$$anonfun$1.apply(CatalystDataToAvro.scala:44) at za.co.absa.abris.avro.sql.CatalystDataToAvro$$anonfun$1.apply(CatalystDataToAvro.scala:43) at scala.Option.flatMap(Option.scala:171) at za.co.absa.abris.avro.sql.CatalystDataToAvro.nullSafeEval(CatalystDataToAvro.scala:43) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:117) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146) at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67) at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Read from multiple topics with multiple schemas

Hi,

I'm a beginner in Scala development & big data streaming. I see most of your examples show how to read & de-serialize from one topic, does the library support multiple topics? or will I have to find a way to do that.
Thank you.

Invalid basic auth properties' names

Example basic auth configuration properties of schema registry client are invalid and have no effect when set:

val securityRegistryConfig = valueRegistryConfig + 
  ("client.basic.auth.credentials.source" -> "USER_INFO",
   "client.schema.registry.basic.auth.user.info" -> "srkey:srvalue")

Those should be:

val securityRegistryConfig = valueRegistryConfig + 
  ("basic.auth.credentials.source" -> "USER_INFO",
   "basic.auth.user.info" -> "srkey:srvalue")

Cannot read key dataframe column using confluent schema registry

The dataframe extension method fromConfluentAvro has got it hard-coded that the schema is for the value. This means we can't easily read the key field if it is also in avro format.

Please expose a mechanism to read key Avro columns too

Thanks for the library by the way ๐Ÿ‘

Cheers
Andy

`internalCreateDataFrame` error

Hi, I'm trying to use version 2.2.2 with Spark 2.2.1 and am running into this error:

java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext.internalCreateDataFrame(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/sql/types/StructType;Z)Lorg/apache/spark/sql/Dataset

I can't share all of the code but I think this is the relevant part:

 val SchemaRegistryConf = Map(
    SchemaManager.PARAM_SCHEMA_REGISTRY_URL ->
      "redacted",
    SchemaManager.PARAM_SCHEMA_REGISTRY_TOPIC -> Topic,
    SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY ->
      SchemaManager.SchemaStorageNamingStrategies.TOPIC_NAME,
    SchemaManager.PARAM_VALUE_SCHEMA_ID -> "latest"
  )

  val logger = Logger(LoggerFactory.getLogger("ProgramTrackFilter"))

  def getStream(spark: SparkSession, bootstrap_servers: String): DataFrame = {
    spark.readStream
      .format(KafkaSource)
      .option("kafka.bootstrap.servers", BootstrapServers)
      .option("subscribe", Topic)
      .fromConfluentAvro("value", None, Some(SchemaRegistryConf))(
        RETAIN_ORIGINAL_SCHEMA
      )
  }

  /* Spark Launcher
   * @param args
   */
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.getOrCreate()

    val df = getStream(spark, BootstrapServers)

    df.printSchema()

    val q =
      df.select("*").writeStream.outputMode("append").format("console").start()

    q.awaitTermination()
  }

The output of printSchema looks correct - I believe I'm getting the right schema. The kafka settings are all correct as well.

Is there any other information I can provide short of actually providing the schema and data (which I can't do, unfortunately)?

More of the stacktrace:

9/02/08 16:28:55 ERROR StreamExecution: Query [id = 7b47f349-a947-49b2-81ab-c4de6eb928f9, runId = f1fea6a5-b2cb-4261-aa35-678970a323a4] terminated with error
java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext.internalCreateDataFrame(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/sql/types/StructType;Z)Lorg/apache/spark/sql/Dataset;
        at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:301)
        at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$7.apply(StreamExecution.scala:614)
        at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$7.apply(StreamExecution.scala:610)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
        at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
        at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:610)
        at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:610)
        at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
        at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
        at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:609)
        at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(StreamExecution.scala:306)
        at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
        at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
        at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
        at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
        at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:294)
        at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
        at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:290)
        at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:206)

supplying kafka key in toConfluentAvro

Hey.
Is there any way to specify kafka key when using toConfluentAvro method?
As far as I see this method converts DF to DF with only 'value' column, which makes it impossible to specify kafka key from one of input cols. As far as i see the same happens in fromConfluentAvro - key columns isn't preserved.
Is there any workaround for this?

`fromConfluentAvro` with RETAIN_ORIGINAL_SCHEMA not printing property/key names

Hello! I'm attempting to use this with confluent schema registry hosted on aiven.io, and I have everything working (using my PR here), but my output when using RETAIN_ORIGINAL_SCHEMA only contains the values of my record, not the keys/properties.

For example, sending a record (registered in schema registry) with:

{
  "meta": {
    "uuid": "uuid: 0.9538...",
    "emitted": "156484..."
  },
  "x": 3,
  "y": 5.0,
  "foo": null
}

yields

+----+----------------------------------------------------+-------------+---------+------+-----------------------+-------------+
|key |value                                               |topic        |partition|offset|timestamp              |timestampType|
+----+----------------------------------------------------+-------------+---------+------+-----------------------+-------------+
|null|[[uuid: 0.9538167490250689, 1564849980152], 3, 5.0,]|example-event|0        |0     |2019-08-03 12:33:01.307|0            |
+----+----------------------------------------------------+-------------+---------+------+-----------------------+-------------+

confluent avro conversion failure

val commonRegistryConfig = Map(
SchemaManager.PARAM_SCHEMA_REGISTRY_TOPIC -> "test",
SchemaManager.PARAM_SCHEMA_REGISTRY_URL -> "http://testkafkaschema.jganalytics.local",
SchemaManager.PARAM_SCHEMA_NAME_FOR_RECORD_STRATEGY -> "test",
SchemaManager.PARAM_SCHEMA_NAMESPACE_FOR_RECORD_STRATEGY -> "sensor.machine.data"
)

val keyRegistryConfig = commonRegistryConfig +
(SchemaManager.PARAM_KEY_SCHEMA_NAMING_STRATEGY -> SchemaManager.SchemaStorageNamingStrategies.TOPIC_RECORD_NAME,
SchemaManager.PARAM_VALUE_SCHEMA_ID -> "latest")

val valueRegistryConfig = commonRegistryConfig +
(SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY -> SchemaManager.SchemaStorageNamingStrategies.TOPIC_RECORD_NAME,
SchemaManager.PARAM_VALUE_SCHEMA_ID -> "latest")

import za.co.absa.abris.avro.functions.to_confluent_avro

SchemaManager.configureSchemaRegistry(commonRegistryConfig)

val result: DataFrame = df
.selectExpr("serialno as key", "to_json(struct(*)) as value")
.select(to_confluent_avro(col("key"), keyRegistryConfig) as 'key,
to_confluent_avro(col("value"), valueRegistryConfig) as 'value)

ERROR 2020-02-26 01:01:25,156 7247 org.apache.spark.executor.Executor [Executor task launch worker for task 0] Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoSuchMethodError: org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema;
at org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185)
at za.co.absa.abris.avro.sql.SchemaProvider$$anonfun$apply$2.apply(SchemaProvider.scala:91)
at za.co.absa.abris.avro.sql.SchemaProvider$$anonfun$apply$2.apply(SchemaProvider.scala:87)
at za.co.absa.abris.avro.sql.SchemaProvider.lazyLoadSchemas(SchemaProvider.scala:41)
at za.co.absa.abris.avro.sql.SchemaProvider.originalSchema(SchemaProvider.scala:53)
at za.co.absa.abris.avro.sql.CatalystDataToAvro.serializer$lzycompute(CatalystDataToAvro.scala:41)
at za.co.absa.abris.avro.sql.CatalystDataToAvro.serializer(CatalystDataToAvro.scala:40)
at za.co.absa.abris.avro.sql.CatalystDataToAvro.nullSafeEval(CatalystDataToAvro.scala:44)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:256)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

java.util.NoSuchElementException: key not found: value.schema.id

I have tried ABRiS on spark-shell as follows (on Spark 2.3.1, Confluent 4.1.0):

Invoking spark-shell:

spark-shell \
--repositories "https://packages.confluent.io/maven/" \
--packages "org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1,io.confluent:kafka-avro-serializer:4.1.1,za.co.absa:abris:2.0.0"

spark-shell script (formatted for reader):

import za.co.absa.abris.avro.AvroSerDe._
import za.co.absa.abris.avro.read.confluent.SchemaManager
import za.co.absa.abris.avro.schemas.policy.SchemaRetentionPolicies._

val schemaRegistryConfs = Map(
    SchemaManager.PARAM_SCHEMA_REGISTRY_URL -> "http://schema-registry:8081/", 
    SchemaManager.PARAM_SCHEMA_REGISTRY_TOPIC -> "topic1"
)
val df = spark.readStream.format("kafka")
    .option("kafka.bootstrap.servers", "kafka1:9092")
    .option("subscribe", "topic1")
    .fromConfluentAvro("value", None, Some(schemaRegistryConfs))(RETAIN_SELECTED_COLUMN_ONLY)

But it makes following error:

java.util.NoSuchElementException: key not found: value.schema.id
  at scala.collection.MapLike$class.default(MapLike.scala:228)
  at scala.collection.AbstractMap.default(Map.scala:59)
  at scala.collection.MapLike$class.apply(MapLike.scala:141)
  at scala.collection.AbstractMap.apply(Map.scala:59)
  at za.co.absa.abris.avro.schemas.SchemaLoader$.loadFromSchemaRegistry(SchemaLoader.scala:65)
  at za.co.absa.abris.avro.parsing.utils.AvroSchemaUtils$.load(AvroSchemaUtils.scala:50)
  at za.co.absa.abris.avro.serde.AvroToRowEncoderFactory$.createRowEncoder(AvroToRowEncoderFactory.scala:44)
  at za.co.absa.abris.avro.serde.AvroDecoder.fromConfluentAvroToRow(AvroDecoder.scala:58)
  at za.co.absa.abris.avro.AvroSerDe$StreamDeserializer.fromConfluentAvro(AvroSerDe.scala:154)
  ... 53 elided

What would be the reason of the problem?

Decode each individual payload using its respective schema id from registry

Is this possible with ABRiS? I am currently using the latest schema for payloads with differing schema ids and the result is a Malformed Record exception from spark deserializer.

Caused by: org.apache.spark.SparkException: Malformed records are detected in record parsing.
	at za.co.absa.abris.avro.sql.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:65)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:216)
	at org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:108)
	at org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:101)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

single field avro schemas being incorrectly unwrapped?

I have an avro schema that consists of a single field (not my choice, someone elses that I need to publish to) and I think it may be getting incorrectly "unwrapped" to a bare string field here: https://github.com/AbsaOSS/ABRiS/blob/master/src/main/scala/za/co/absa/abris/avro/sql/SchemaProvider.scala#L74

This gives me an error:

org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst type StructType(StructField(some_field,StringType,false)) to Avro type "string".
at org.apache.spark.sql.avro.AvroSerializer.newStructConverter(AvroSerializer.scala:205)
{
  "type": "record",
  "name": "key",
  "fields": [
    {
      "name": "some_field",
      "type": "string"
    }
  ]
}

but this seems to work:

{
  "type": "record",
  "name": "wrapper",
  "fields": [
    {
      "name": "wrapped",
      "type": 
        {
          "type": "record",
          "name": "key",
          "fields": [
            {
              "name": "some_field",
              "type": "string"
ย            }
          ]
        }
      }
    ]
}

AvroSerdewithKeyColumn does not support NULL key

We are facing the following issue where we want to write data to Kafka based on partitioning key that can be null. The job crashes because it expects a non nullable key.

Is this expected or behavior or can a pull request be made to change this?

A quick fix for this would be to filter the records with null key at the producer side, but then we don't have any information on how many "data" that is being dropped on the consumer side.

Another fix would be to replace the null value with a wildcard or hash to keep everything evenly partitioned, but this is also not desired behavior.

Is there a possibility to change the toConfluentAvroWithPlainKey() in order for it to allow nullable keys?

Thanks!

Spark Streaming - terminated with exception: Could not get the id of the latest version for subject

Hi, I am facing exception while trying to get latest version of schema version for subject even though it exists using spark streaming. This is not happening always, but i believe when the schema registry server is momentarily down for just few seconds/milliseconds.

Thinking of adding a retry logic to overcome the issue. Here are the few exceptions around it.

terminated with exception: Could not get the id of the latest version for subject "dummy_topic-value"

terminated with exception: Job aborted due to stage failure: Task 2 in stage 23483.0 failed 4 times, most recent failure: Lost task 2.3 in stage 23483.0 (TID 122887,, executor 105): avro.read.SchemaManagerException: Could not get the id of the latest version for subject 'dummy_topic-value'
Adjusted code change. Please suggest if there is any better way.

SchemaManager.Scala

    def getLatestVersionId(subject: String): Int = {
      logDebug(s"Trying to get latest schema version id for subject '$subject'")
      throwIfClientNotConfigured()

      getIDRetry(3) { schemaRegistryClient.getLatestSchemaMetadata(subject).getId }

/*      Try(schemaRegistryClient.getLatestSchemaMetadata(subject).getId) match {
        case Success(id) => id
        case Failure(e) => throw new SchemaManagerException(
          s"Could not get the id of the latest version for subject '$subject'", e)
      }*/
    }

  @annotation.tailrec
  def getIDRetry[T](n: Int)(fn: => T): T = {
    util.Try { fn } match {
      case util.Success(x) => x
      case _ if n > 1 => retry(n - 1){println("Tried fetching latest Schema id " + n + "times");Thread.sleep(10000); fn}
      case util.Failure(e) => throw e
    }
  }

OutOfMemmory concern

Thank you for your implementation to fill the gap of spark-kafka integration using avro.

private def getBatchData() = dataframe.select("value").cache()

Just curious, is there any reason for the need of caching here? This may lead to unexpected caching and/or Memmory issue.

We usually avoid caching on RDD/Dataframe especially for large dataset, and only cache/persist on heavy computation but small/reasonable size of result dataset to be cached.

ArrayIndexOutOfBoundsException with simple read

Hey, guys.

I've encountered a problem like below, could anyone help me please?

Code:

val kafkaDataFrame = spark.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "localhost:9092")
      .option("subscribe", topicName)
      .option("startingOffsets", "earliest")
      .option("checkpointLocation", "/tmp/")
      .option("maxOffsetsPerTrigger", 20)  //remove for prod      
      .fromAvro("value", schemaRegistryConfs)(RETAIN_SELECTED_COLUMN_ONLY) 

Avro Schema

{"type":"record","name":"VIP_TG8izMiYViu7Ftoq_0000000000","fields":[{"name":"f1","type":["null","string"]},{"name":"f21","type":["null","string"]},{"name":"f30","type":["null","string"]},{"name":"f33","type":["null","string"]},{"name":"f31","type":["null","string"]},{"name":"f34","type":["null","string"]},{"name":"f11","type":["null","string"]},{"name":"f16","type":["null","string"]},{"name":"f17","type":["null","string"]},{"name":"f18","type":["null","string"]},{"name":"f22","type":["null","string"]},{"name":"f28","type":["null","string"]},{"name":"f39","type":["null","string"]},{"name":"f14","type":["null","string"]},{"name":"f19","type":["null","string"]},{"name":"f20","type":["null","string"]},{"name":"f25","type":["null","string"]},{"name":"f29","type":["null","string"]},{"name":"f35","type":["null","string"]},{"name":"f5","type":["null","string"]},{"name":"f10","type":["null","string"]},{"name":"f13","type":["null","string"]},{"name":"f23","type":["null","string"]},{"name":"f24","type":["null","string"]},{"name":"f7","type":["null","string"]},{"name":"f36","type":["null","string"]},{"name":"f40","type":["null","string"]},{"name":"f6","type":["null","string"]},{"name":"f12","type":["null","string"]},{"name":"f15","type":["null","string"]},{"name":"f26","type":["null","string"]},{"name":"f37","type":["null","string"]},{"name":"f4","type":["null","string"]},{"name":"f9","type":["null","string"]},{"name":"f27","type":["null","string"]},{"name":"f38","type":["null","string"]},{"name":"f2","type":["null","string"]},{"name":"f3","type":["null","string"]},{"name":"f8","type":["null","string"]},{"name":"f32","type":["null","string"]},{"name":"raw_text","type":["null","string"]},{"name":"_appId","type":"string"},{"name":"_repo","type":"string"},{"name":"testdp_1380542074_timestamp","type":"string"}]}

Exception:

18/08/15 15:25:14 ERROR Utils: Aborting task
java.lang.ArrayIndexOutOfBoundsException: 22610
	at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:402)
	at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
	at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:178)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
	at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:240)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:230)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:174)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
	at za.co.absa.abris.avro.serde.AvroToRowConverter.convert(AvroToRowConverter.scala:41)
	at za.co.absa.abris.avro.serde.AvroDecoder$$anonfun$fromAvroToRow$3$$anonfun$apply$12.apply(AvroDecoder.scala:362)
	at za.co.absa.abris.avro.serde.AvroDecoder$$anonfun$fromAvroToRow$3$$anonfun$apply$12.apply(AvroDecoder.scala:361)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.foreach(WholeStageCodegenExec.scala:612)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2.scala:130)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2.scala:129)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2.scala:135)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:79)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
18/08/15 15:25:14 ERROR Utils: Aborting task
java.lang.ArrayIndexOutOfBoundsException: 6226
	at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:402)
	at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
	at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:178)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
	at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:240)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:230)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:174)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
	at za.co.absa.abris.avro.serde.AvroToRowConverter.convert(AvroToRowConverter.scala:41)
	at za.co.absa.abris.avro.serde.AvroDecoder$$anonfun$fromAvroToRow$3$$anonfun$apply$12.apply(AvroDecoder.scala:362)
	at za.co.absa.abris.avro.serde.AvroDecoder$$anonfun$fromAvroToRow$3$$anonfun$apply$12.apply(AvroDecoder.scala:361)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.foreach(WholeStageCodegenExec.scala:612)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2.scala:130)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2.scala:129)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2.scala:135)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:79)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
18/08/15 15:25:14 ERROR DataWritingSparkTask: Writer for partition 1 is aborting.
18/08/15 15:25:14 ERROR DataWritingSparkTask: Writer for partition 1 aborted.
18/08/15 15:25:14 ERROR DataWritingSparkTask: Writer for partition 0 is aborting.
18/08/15 15:25:14 ERROR DataWritingSparkTask: Writer for partition 0 aborted.
18/08/15 15:25:14 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.ArrayIndexOutOfBoundsException: 22610
	at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:402)
	at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
	at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:178)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
	at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:240)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:230)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:174)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
	at za.co.absa.abris.avro.serde.AvroToRowConverter.convert(AvroToRowConverter.scala:41)
	at za.co.absa.abris.avro.serde.AvroDecoder$$anonfun$fromAvroToRow$3$$anonfun$apply$12.apply(AvroDecoder.scala:362)
	at za.co.absa.abris.avro.serde.AvroDecoder$$anonfun$fromAvroToRow$3$$anonfun$apply$12.apply(AvroDecoder.scala:361)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.foreach(WholeStageCodegenExec.scala:612)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2.scala:130)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2.scala:129)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2.scala:135)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:79)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
18/08/15 15:25:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ArrayIndexOutOfBoundsException: 6226
	at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:402)
	at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
	at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:178)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
	at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:240)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:230)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:174)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
	at za.co.absa.abris.avro.serde.AvroToRowConverter.convert(AvroToRowConverter.scala:41)
	at za.co.absa.abris.avro.serde.AvroDecoder$$anonfun$fromAvroToRow$3$$anonfun$apply$12.apply(AvroDecoder.scala:362)
	at za.co.absa.abris.avro.serde.AvroDecoder$$anonfun$fromAvroToRow$3$$anonfun$apply$12.apply(AvroDecoder.scala:361)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.foreach(WholeStageCodegenExec.scala:612)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2.scala:130)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2.scala:129)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2.scala:135)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:79)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
18/08/15 15:25:14 ERROR TaskSetManager: Task 1 in stage 0.0 failed 1 times; aborting job
18/08/15 15:25:14 ERROR WriteToDataSourceV2Exec: Data source writer org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@541f5db is aborting.
18/08/15 15:25:14 ERROR WriteToDataSourceV2Exec: Data source writer org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@541f5db aborted.
18/08/15 15:25:14 ERROR MicroBatchExecution: Query [id = cfaf0142-ff0a-4664-a67a-f380267605d8, runId = 7a73c55b-0a49-4d3d-8e62-70a0efe958c0] terminated with error
org.apache.spark.SparkException: Writing job aborted.
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:112)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294)
	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3273)
	at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722)
	at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722)
	at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
	at org.apache.spark.sql.Dataset.collect(Dataset.scala:2722)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:478)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:473)
	at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:472)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
	at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
	at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
	at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 22610
	at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:402)
	at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
	at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:178)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
	at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:240)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:230)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:174)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
	at za.co.absa.abris.avro.serde.AvroToRowConverter.convert(AvroToRowConverter.scala:41)
	at za.co.absa.abris.avro.serde.AvroDecoder$$anonfun$fromAvroToRow$3$$anonfun$apply$12.apply(AvroDecoder.scala:362)
	at za.co.absa.abris.avro.serde.AvroDecoder$$anonfun$fromAvroToRow$3$$anonfun$apply$12.apply(AvroDecoder.scala:361)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.foreach(WholeStageCodegenExec.scala:612)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2.scala:130)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2.scala:129)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2.scala:135)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:79)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:82)
	... 31 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 22610
	at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:402)
	at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
	at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:178)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
	at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:240)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:230)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:174)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
	at za.co.absa.abris.avro.serde.AvroToRowConverter.convert(AvroToRowConverter.scala:41)
	at za.co.absa.abris.avro.serde.AvroDecoder$$anonfun$fromAvroToRow$3$$anonfun$apply$12.apply(AvroDecoder.scala:362)
	at za.co.absa.abris.avro.serde.AvroDecoder$$anonfun$fromAvroToRow$3$$anonfun$apply$12.apply(AvroDecoder.scala:361)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.foreach(WholeStageCodegenExec.scala:612)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2.scala:130)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2.scala:129)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2.scala:135)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:79)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: Writing job aborted.
=== Streaming Query ===
Identifier: [id = cfaf0142-ff0a-4664-a67a-f380267605d8, runId = 7a73c55b-0a49-4d3d-8e62-70a0efe958c0]
Current Committed Offsets: {}
Current Available Offsets: {KafkaSource[Subscribe[VIP_TG8izMiYViu7Ftoq_0000000000]]: {"VIP_TG8izMiYViu7Ftoq_0000000000":{"1":9,"0":10}}}

Current State: ACTIVE
Thread State: RUNNABLE

Logical Plan:
SerializeFromObject [if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, f1), StringType), true, false) AS f1#70, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 1, f21), StringType), true, false) AS f21#71, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 2, f30), StringType), true, false) AS f30#72, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 3, f33), StringType), true, false) AS f33#73, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 4, f31), StringType), true, false) AS f31#74, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 5, f34), StringType), true, false) AS f34#75, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 6, f11), StringType), true, false) AS f11#76, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 7, f16), StringType), true, false) AS f16#77, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 8, f17), StringType), true, false) AS f17#78, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 9, f18), StringType), true, false) AS f18#79, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 10, f22), StringType), true, false) AS f22#80, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 11, f28), StringType), true, false) AS f28#81, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 12, f39), StringType), true, false) AS f39#82, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 13, f14), StringType), true, false) AS f14#83, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 14, f19), StringType), true, false) AS f19#84, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 15, f20), StringType), true, false) AS f20#85, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 16, f25), StringType), true, false) AS f25#86, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 17, f29), StringType), true, false) AS f29#87, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 18, f35), StringType), true, false) AS f35#88, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 19, f5), StringType), true, false) AS f5#89, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 20, f10), StringType), true, false) AS f10#90, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 21, f13), StringType), true, false) AS f13#91, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 22, f23), StringType), true, false) AS f23#92, if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 23, f24), StringType), true, false) AS f24#93, ... 20 more fields]
+- MapPartitions <function1>, obj#69: org.apache.spark.sql.Row
   +- DeserializeToObject cast(value#8 as binary), obj#68: binary
      +- Project [value#8]
         +- StreamingExecutionRelation KafkaSource[Subscribe[VIP_TG8izMiYViu7Ftoq_0000000000]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]

	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
	at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: org.apache.spark.SparkException: Writing job aborted.
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:112)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294)
	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3273)
	at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722)
	at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722)
	at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
	at org.apache.spark.sql.Dataset.collect(Dataset.scala:2722)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:478)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:473)
	at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:472)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
	at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
	at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
	... 1 more
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 22610
	at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:402)
	at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
	at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:178)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
	at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:240)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:230)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:174)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
	at za.co.absa.abris.avro.serde.AvroToRowConverter.convert(AvroToRowConverter.scala:41)
	at za.co.absa.abris.avro.serde.AvroDecoder$$anonfun$fromAvroToRow$3$$anonfun$apply$12.apply(AvroDecoder.scala:362)
	at za.co.absa.abris.avro.serde.AvroDecoder$$anonfun$fromAvroToRow$3$$anonfun$apply$12.apply(AvroDecoder.scala:361)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.foreach(WholeStageCodegenExec.scala:612)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2.scala:130)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2.scala:129)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2.scala:135)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:79)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:82)
	... 31 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 22610
	at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:402)
	at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
	at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:178)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
	at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:240)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:230)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:174)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
	at za.co.absa.abris.avro.serde.AvroToRowConverter.convert(AvroToRowConverter.scala:41)
	at za.co.absa.abris.avro.serde.AvroDecoder$$anonfun$fromAvroToRow$3$$anonfun$apply$12.apply(AvroDecoder.scala:362)
	at za.co.absa.abris.avro.serde.AvroDecoder$$anonfun$fromAvroToRow$3$$anonfun$apply$12.apply(AvroDecoder.scala:361)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.foreach(WholeStageCodegenExec.scala:612)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2.scala:130)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2.scala:129)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2.scala:135)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:79)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Process finished with exit code 1

usage of toAvro method

i'm not sure that i understood correctly; for this signature toAvro(schemaName: String, schemaNamespace: String): Dataset[Array[Byte]] the namespace should already exist in SchemaRegistry but the schemaName is a name we give to our new schema ( in the spark application ).
unfortunately, when i did that, i realized that the schema was not added
also, for this signature 'toAvro(rows: Dataset[Row], schemas: SchemasProcessor)' could you give me insights how to prepare the 'schemas' parameter ( because the SchemasProcessor has only two getters )

new possible feature: toConfluentAvro with Dataset[T] for user provided T

hi, great work on this library!
Will you add the same API for arbitrary Dataset?
I am willing to contribute if you are open to pull request.

The problem of the Dataframe API is that
I get a Dataset[T] from structured streaming, and have to use .toDF before writing Confluent Kafka.
Then I got

Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

without any other log.

ABRiS 3.x possible with Spark 2.3.0?

As of #48 ABRiS now requires spark 2.4.4. Is there something inherent that needs this version? I'm trying to use ABRiS, and specifically to_avro. I'm bound by my hadoop cluster distribution at 2.3.0 and when testing locally, if I go from ABRiS 2.2.4 to 3.x my local run goes to spark 2.4.x.
Adding exclusions to try and get back to 2.3.0 I'm getting an error

Exception in thread "main" java.lang.IllegalAccessError: tried to access method org.apache.spark.util.EventLoop.eventThread()Ljava/lang/Thread; from class org.apache.spark.sql.internal.SQLConf$$anonfun$14
I'm pretty sure that's related to my exclusions but now I'm not sure about how to get this to work with 2.3.0. Any suggestions?

Issue while calling pyspark

Hi , I am trying to deserlize kafka data through pyspark. But I am getting below error.

{TypeError}Invalid argument, not a string or column: from_avro(value) AS data of type <class 'py4j.java_gateway.JavaObject'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
environment
pyspark :- 2.4.3
abris version :- za.co.absa:abris_2.11:3.0.2

Code base for same is as below

from pyspark.sql.dataframe import DataFrame
import logging, traceback
from confluent_kafka import Consumer as kafka_Consumer, KafkaError, TopicPartition
import time
import requests
from pyspark.sql import functions as F
from pyspark.sql import DataFrame, Column
from pyspark.sql.functions import udf

def deserialize_avro1(spark_context, sql_context, data_frame, schema_registry_url, topic):
j = spark_context._gateway.jvm
naming_strategy = getattr(
getattr(j.za.co.absa.abris.avro.read.confluent.SchemaManager,
"SchemaStorageNamingStrategies$"), "MODULE$").TOPIC_NAME()
logger.info("SchemaStorageNamingStrategy: for topic " + topic + " is " + str(naming_strategy))
conf = getattr(getattr(j.scala.collection.immutable.Map, "EmptyMap$"), "MODULE$")
conf = getattr(conf, "$plus")(j.scala.Tuple2("schema.registry.url", schema_registry_url))
conf = getattr(conf, "$plus")(j.scala.Tuple2("schema.registry.topic", topic))
conf = getattr(conf, "$plus")(j.scala.Tuple2("value.schema.id", "latest"))
conf = getattr(conf, "$plus")(j.scala.Tuple2("value.schema.naming.strategy", naming_strategy))

    data_frame.select(j.za.co.absa.abris.avro.functions.from_confluent_avro(data_frame._jdf.col("value"), conf).alias("data")).select("data.*")

Can you please help in addressing this. or let me know if I am missing anything over here.

writing to kafka topic

I have tested .toAvro when writing to a kafka topic.
I believe the to avro is serializing, but when the action portion is run spark just hands and the job is not completed. Kafka never sees the message.

import org.apache.spark.sql.SparkSession
import za.co.absa.abris.avro.AvroSerDe._

object oppStreamingMatching {
def main(args: Array[String]): Unit = {

val spark = SparkSession
  .builder
  .appName("oppStreamingMatching")
  .getOrCreate()

val brokers = "IP:9092"
val schemaRegistryURL = "http://IP:8081"
var topic = "exampletopic"

val path = ".. file.avro"
val schema = spark.read.format("com.databricks.spark.avro").load(path).schema

val df = spark
  .read
  .format("com.databricks.spark.avro")
  .schema(schema)
  .load(path)

df
  .toAvro("test", "mytest")
  .write
  .format("kafka")
  .option("checkpointLocation", "/tmp")
  .option("kafka.bootstrap.servers",brokers )
  .option("topic", topic)
  .save()

}}

java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to org.apache.avro.generic.GenericRecord

I tried ConfluentKafkaAvroReader using my schema in schema registration and go the following error

Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to org.apache.avro.generic.GenericRecord
at za.co.absa.abris.avro.sql.AvroDeserializer$$anonfun$za$co$absa$abris$avro$sql$AvroDeserializer$$newWriter$18.apply(AvroDeserializer.scala:172)
at za.co.absa.abris.avro.sql.AvroDeserializer$$anonfun$za$co$absa$abris$avro$sql$AvroDeserializer$$newWriter$18.apply(AvroDeserializer.scala:170)
at za.co.absa.abris.avro.sql.AvroDeserializer$$anonfun$8.apply(AvroDeserializer.scala:332)
at za.co.absa.abris.avro.sql.AvroDeserializer$$anonfun$8.apply(AvroDeserializer.scala:328)
at za.co.absa.abris.avro.sql.AvroDeserializer$$anonfun$getRecordWriter$1.apply(AvroDeserializer.scala:350)
at za.co.absa.abris.avro.sql.AvroDeserializer$$anonfun$getRecordWriter$1.apply(AvroDeserializer.scala:347)
at za.co.absa.abris.avro.sql.AvroDeserializer$$anonfun$3.apply(AvroDeserializer.scala:60)
at za.co.absa.abris.avro.sql.AvroDeserializer$$anonfun$3.apply(AvroDeserializer.scala:58)
at za.co.absa.abris.avro.sql.AvroDeserializer.deserialize(AvroDeserializer.scala:74)
at za.co.absa.abris.avro.sql.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:59)
... 17 more

Confluent Schema Registry with Basic Authentication

How to use from_confluent_avro when schema registry is secured.
we need to pass below parameters for authentication
basic.auth.credentials.source=USER_INFO
schema.registry.basic.auth.user.info=<SR_API_KEY>:<SR_API_SECRET>
schema.registry.url=https://hostname

How to pass this parameters using schema manager

Support for enums in avro schema

The fromAvro function dosent seem to work well if there are enums in the avro schema.

Caused by: java.lang.RuntimeException: org.apache.avro.generic.GenericData$EnumSymbol is not a valid external type for schema of string at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.StaticInvoke_5$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_4_5$(Unknown Source)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.