The dataframe extension method fromConfluentAvro has got it hard-coded that the schema

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

I see now <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Cannot read key dataframe column using confluent schema registry about abris HOT 11 CLOSED

absaoss commented on May 26, 2024

Cannot read key dataframe column using confluent schema registry

from abris.

Comments (11)

felipemmelo commented on May 26, 2024

Hi @andybryant , how are you? Thanks for reaching out and happy the library has been useful to you.

ABRiS does support keys from Confluent, have you tried this?

from abris.

andybryant commented on May 26, 2024

Hi @felipemmelo

Thanks for responding.

I did see that class, but it looks like the reading of confluent keys is just added for Spark Streaming.
There are extension methods for DataFrame for writing to Kafka with keys and values, but I couldn't see any for reading from Kafka, unless I'm missing something.

I managed to get it going locally by adding some additional extension methods, but I had a new implicit class in your package to see all the handy utility methods.

Cheers
Andy

from abris.

felipemmelo commented on May 26, 2024

Got it, just to check, have you gone through the examples available here?

You can try to use the Confluent reader and writer ones. Please let me know if it helps.

from abris.

andybryant commented on May 26, 2024

Hi @felipemmelo

Thanks for the link. The reading example is just for reading streams, not for the DataFrame api.

Cheers
Andy

from abris.

felipemmelo commented on May 26, 2024

Hi @andybryant , I'm afraid I'm not following :)

This class, for instance, is reading keys and values from a Confluent Kafka broker retrieving the schema from Schema Registry. Isn't it what you're looking for?

Also, the Dataframe API is there for sure, as the whole library was built to support Structure Streaming, which is entirely base on Dataframes. In other words, the library helps you to load your Avro payload into a Spark Dataframe. After the decoding, you have your dataframe and the library has no role.

from abris.

andybryant commented on May 26, 2024

Hi Felipe That example works on a DataStreamReader which is used for reading from a stream. I needed an fromConfluentAvro extension method for a DataFrame used in a batch process.

…

On Tue, 30 Jul 2019 at 18:27, Felipe Melo ***@***.***> wrote: Hi @andybryant <https://github.com/andybryant> , I'm afraid I'm not following :) This <https://github.com/AbsaOSS/ABRiS/blob/master/src/main/scala/za/co/absa/abris/examples/using_keys/ConfluentKafkaAvroReaderWithKey.scala> class, for instance, is reading keys and values from a Confluent Kafka broker retrieving the schema from Schema Registry. Isn't it what you're looking for? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38?email_source=notifications&email_token=AAGEHUYA42YNPAV4J4YWANDQB73OHA5CNFSM4IGAMRN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3DGTIY#issuecomment-516319651>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAGEHU2WGFZPKHZUSB4TIILQB73OHANCNFSM4IGAMRNQ> .

-- — Andy Bryant Associate Director [email protected] Office: +61 2 9356 7000 L1/283 Liverpool Street Darlinghurst, NSW 2010 *simplemachines.com.au* <http://simplemachines.com.au/> [image: LinkedIn] <https://www.linkedin.com/company/digital-tomato/about/> [image: Twitter] <https://twitter.com/simpmach> [image: Youtube] <https://www.youtube.com/channel/UCGA1XCnR7acnhNnl6tYDEEw/featured> [image: Simple Machines] <http://simplemachines.com.au/>

from abris.

felipemmelo commented on May 26, 2024

I see now @andybryant , so the point is not actually related to accessing the key but instead to process Confluent-like Avro on batch-mode.

I'll try to find some time to add this API entry. If you don't mind, I'll rename this issue to more closely reflect the problem.

Also, taking the chance, since this is an interesting situation, may I ask you what is the use case for processing data from Confluent Kafka in batch mode?

Cheers.

from abris.

andybryant commented on May 26, 2024

Hi Felipe

As you say our use case is a little unusual. It's part of a machine learning pipeline. We're updating a store of events based on stream of events captured in Kafka. We're storing the events in Parquet format which works best for fairly large files, so we only update the store once a day.

We could have run this as a streaming job with poll frequency of a day, but that would mean having the job running continuously. Instead we kick off a batch job once a day in a dedicated EMR cluster and drop the cluster after the job is complete.

To get this going for our use case I added the following patch class. I needed to put it in one of your packages to get access to the package private internal classes.


import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.{ Dataset, Row }
import za.co.absa.abris.avro.format.SparkAvroConversions
import za.co.absa.abris.avro.parsing.utils.AvroSchemaUtils
import za.co.absa.abris.avro.read.confluent.ScalaConfluentKafkaAvroDeserializer
import za.co.absa.abris.avro.serde.{ AvroDecoder, AvroReaderFactory, AvroToRowConverter, AvroToRowEncoderFactory }

object AvroPatch {

  // no way to access the key using the standard absa library

  implicit class AvroDeserializer(dataframe: Dataset[Row]) extends AvroDecoder {

    /**
     * This method supports schema changes from Schema Registry. However, the conversion between Avro records and Spark
     * rows relies on RowEncoders, which are defined before the job starts. Thus, although the schema changes are supported
     * while reading, they are not translated to RowEncoders, which could take to errors in the final data.
     *
     * Refer to the [[ScalaConfluentKafkaAvroDeserializer.deserialize()]] documentation to better understand how this
     * operation is performed.
     */
    def fromConfluentAvroForKey(destinationColumn: String, schemaRegistryConf: Map[String, String]): Dataset[Row] = {
      val dataSchema = AvroSchemaUtils.loadForKeyAndValue(schemaRegistryConf)._1

      val originalSchema = dataframe.schema

      // sets the Avro schema into the destination field
      val destinationIndex = originalSchema.fields.toList.indexWhere(_.name.toLowerCase == destinationColumn.toLowerCase)
      originalSchema.fields(destinationIndex) = StructField(destinationColumn, SparkAvroConversions.toSqlType(dataSchema), nullable = false)

      implicit val rowEncoder = AvroToRowEncoderFactory.createRowEncoder(originalSchema)

      dataframe
        .mapPartitions(partition => {

          val avroReader = AvroReaderFactory.createConfiguredConfluentAvroReader(None, Some(schemaRegistryConf))
          val avroToRowConverter = new AvroToRowConverter(None)

          partition.map(avroRecord => {

            val sparkType = avroToRowConverter.convert(avroReader.deserialize(avroRecord.get(destinationIndex).asInstanceOf[Array[Byte]]))
            val array: Array[Any] = new Array(avroRecord.size)

            for (i <- 0 until avroRecord.size) {
              array(i) = avroRecord.get(i)
            }
            array(destinationIndex) = sparkType
            Row.fromSeq(array)
          })
        })
    }

  }

}

from abris.

felipemmelo commented on May 26, 2024

Hi @andybryant , sorry for the late reply.

I've faced similar issues here and solved them using once-off triggers such as explained here.

It seems the same would work for you. Since you're already launching the batch job, you could change it to a stream once with Trigger.once() and the library would do the rest.

from abris.

andybryant commented on May 26, 2024

Thanks for the tip Felipe - that looks like exactly what I need. I'll give it a try Cheers Andy — Andy Bryant Associate Director [email protected] Office: +61 2 9356 7000 L1/283 Liverpool Street Darlinghurst, NSW 2010 simplemachines.com.au

…

On Tue, 6 Aug 2019 at 02:43, Felipe Melo ***@***.***> wrote: Hi @andybryant <https://github.com/andybryant> , sorry for the late reply. I've faced similar issues here and solved them using once-off triggers such as explained here <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers> . It seems the same would work for you. Since you're already launching the batch job, you could change it to a stream once with Trigger.once() and the library would do the rest. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38?email_source=notifications&email_token=AAGEHU754LPTSHO3VIGCW6DQDBKEPA5CNFSM4IGAMRN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3SMNSI#issuecomment-518309577>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAGEHU4EUJDAH3ILNMGCK4LQDBKEPANCNFSM4IGAMRNQ> .

from abris.

felipemmelo commented on May 26, 2024

If any issues, please feel free to reopen it.

from abris.

Cannot read key dataframe column using confluent schema registry about abris HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs