GithubHelp home page GithubHelp logo

Comments (11)

felipemmelo avatar felipemmelo commented on May 26, 2024

Hi @andybryant , how are you? Thanks for reaching out and happy the library has been useful to you.

ABRiS does support keys from Confluent, have you tried this?

from abris.

andybryant avatar andybryant commented on May 26, 2024

Hi @felipemmelo

Thanks for responding.

I did see that class, but it looks like the reading of confluent keys is just added for Spark Streaming.
There are extension methods for DataFrame for writing to Kafka with keys and values, but I couldn't see any for reading from Kafka, unless I'm missing something.

I managed to get it going locally by adding some additional extension methods, but I had a new implicit class in your package to see all the handy utility methods.

Cheers
Andy

from abris.

felipemmelo avatar felipemmelo commented on May 26, 2024

Got it, just to check, have you gone through the examples available here?

You can try to use the Confluent reader and writer ones. Please let me know if it helps.

from abris.

andybryant avatar andybryant commented on May 26, 2024

Hi @felipemmelo

Thanks for the link. The reading example is just for reading streams, not for the DataFrame api.

Cheers
Andy

from abris.

felipemmelo avatar felipemmelo commented on May 26, 2024

Hi @andybryant , I'm afraid I'm not following :)

This class, for instance, is reading keys and values from a Confluent Kafka broker retrieving the schema from Schema Registry. Isn't it what you're looking for?

Also, the Dataframe API is there for sure, as the whole library was built to support Structure Streaming, which is entirely base on Dataframes. In other words, the library helps you to load your Avro payload into a Spark Dataframe. After the decoding, you have your dataframe and the library has no role.

from abris.

andybryant avatar andybryant commented on May 26, 2024

from abris.

felipemmelo avatar felipemmelo commented on May 26, 2024

I see now @andybryant , so the point is not actually related to accessing the key but instead to process Confluent-like Avro on batch-mode.

I'll try to find some time to add this API entry. If you don't mind, I'll rename this issue to more closely reflect the problem.

Also, taking the chance, since this is an interesting situation, may I ask you what is the use case for processing data from Confluent Kafka in batch mode?

Cheers.

from abris.

andybryant avatar andybryant commented on May 26, 2024

Hi Felipe

As you say our use case is a little unusual. It's part of a machine learning pipeline. We're updating a store of events based on stream of events captured in Kafka. We're storing the events in Parquet format which works best for fairly large files, so we only update the store once a day.

We could have run this as a streaming job with poll frequency of a day, but that would mean having the job running continuously. Instead we kick off a batch job once a day in a dedicated EMR cluster and drop the cluster after the job is complete.

To get this going for our use case I added the following patch class. I needed to put it in one of your packages to get access to the package private internal classes.


import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.{ Dataset, Row }
import za.co.absa.abris.avro.format.SparkAvroConversions
import za.co.absa.abris.avro.parsing.utils.AvroSchemaUtils
import za.co.absa.abris.avro.read.confluent.ScalaConfluentKafkaAvroDeserializer
import za.co.absa.abris.avro.serde.{ AvroDecoder, AvroReaderFactory, AvroToRowConverter, AvroToRowEncoderFactory }

object AvroPatch {

  // no way to access the key using the standard absa library

  implicit class AvroDeserializer(dataframe: Dataset[Row]) extends AvroDecoder {

    /**
     * This method supports schema changes from Schema Registry. However, the conversion between Avro records and Spark
     * rows relies on RowEncoders, which are defined before the job starts. Thus, although the schema changes are supported
     * while reading, they are not translated to RowEncoders, which could take to errors in the final data.
     *
     * Refer to the [[ScalaConfluentKafkaAvroDeserializer.deserialize()]] documentation to better understand how this
     * operation is performed.
     */
    def fromConfluentAvroForKey(destinationColumn: String, schemaRegistryConf: Map[String, String]): Dataset[Row] = {
      val dataSchema = AvroSchemaUtils.loadForKeyAndValue(schemaRegistryConf)._1

      val originalSchema = dataframe.schema

      // sets the Avro schema into the destination field
      val destinationIndex = originalSchema.fields.toList.indexWhere(_.name.toLowerCase == destinationColumn.toLowerCase)
      originalSchema.fields(destinationIndex) = StructField(destinationColumn, SparkAvroConversions.toSqlType(dataSchema), nullable = false)

      implicit val rowEncoder = AvroToRowEncoderFactory.createRowEncoder(originalSchema)

      dataframe
        .mapPartitions(partition => {

          val avroReader = AvroReaderFactory.createConfiguredConfluentAvroReader(None, Some(schemaRegistryConf))
          val avroToRowConverter = new AvroToRowConverter(None)

          partition.map(avroRecord => {

            val sparkType = avroToRowConverter.convert(avroReader.deserialize(avroRecord.get(destinationIndex).asInstanceOf[Array[Byte]]))
            val array: Array[Any] = new Array(avroRecord.size)

            for (i <- 0 until avroRecord.size) {
              array(i) = avroRecord.get(i)
            }
            array(destinationIndex) = sparkType
            Row.fromSeq(array)
          })
        })
    }

  }

}

from abris.

felipemmelo avatar felipemmelo commented on May 26, 2024

Hi @andybryant , sorry for the late reply.

I've faced similar issues here and solved them using once-off triggers such as explained here.

It seems the same would work for you. Since you're already launching the batch job, you could change it to a stream once with Trigger.once() and the library would do the rest.

from abris.

andybryant avatar andybryant commented on May 26, 2024

from abris.

felipemmelo avatar felipemmelo commented on May 26, 2024

If any issues, please feel free to reopen it.

from abris.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.