Comments (11)
Hi @andybryant , how are you? Thanks for reaching out and happy the library has been useful to you.
ABRiS does support keys from Confluent, have you tried this?
from abris.
Hi @felipemmelo
Thanks for responding.
I did see that class, but it looks like the reading of confluent keys is just added for Spark Streaming.
There are extension methods for DataFrame for writing to Kafka with keys and values, but I couldn't see any for reading from Kafka, unless I'm missing something.
I managed to get it going locally by adding some additional extension methods, but I had a new implicit class in your package to see all the handy utility methods.
Cheers
Andy
from abris.
Got it, just to check, have you gone through the examples available here?
You can try to use the Confluent reader and writer ones. Please let me know if it helps.
from abris.
Hi @felipemmelo
Thanks for the link. The reading example is just for reading streams, not for the DataFrame api.
Cheers
Andy
from abris.
Hi @andybryant , I'm afraid I'm not following :)
This class, for instance, is reading keys and values from a Confluent Kafka broker retrieving the schema from Schema Registry. Isn't it what you're looking for?
Also, the Dataframe API is there for sure, as the whole library was built to support Structure Streaming, which is entirely base on Dataframes. In other words, the library helps you to load your Avro payload into a Spark Dataframe. After the decoding, you have your dataframe and the library has no role.
from abris.
from abris.
I see now @andybryant , so the point is not actually related to accessing the key but instead to process Confluent-like Avro on batch-mode.
I'll try to find some time to add this API entry. If you don't mind, I'll rename this issue to more closely reflect the problem.
Also, taking the chance, since this is an interesting situation, may I ask you what is the use case for processing data from Confluent Kafka in batch mode?
Cheers.
from abris.
Hi Felipe
As you say our use case is a little unusual. It's part of a machine learning pipeline. We're updating a store of events based on stream of events captured in Kafka. We're storing the events in Parquet format which works best for fairly large files, so we only update the store once a day.
We could have run this as a streaming job with poll frequency of a day, but that would mean having the job running continuously. Instead we kick off a batch job once a day in a dedicated EMR cluster and drop the cluster after the job is complete.
To get this going for our use case I added the following patch class. I needed to put it in one of your packages to get access to the package private internal classes.
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.{ Dataset, Row }
import za.co.absa.abris.avro.format.SparkAvroConversions
import za.co.absa.abris.avro.parsing.utils.AvroSchemaUtils
import za.co.absa.abris.avro.read.confluent.ScalaConfluentKafkaAvroDeserializer
import za.co.absa.abris.avro.serde.{ AvroDecoder, AvroReaderFactory, AvroToRowConverter, AvroToRowEncoderFactory }
object AvroPatch {
// no way to access the key using the standard absa library
implicit class AvroDeserializer(dataframe: Dataset[Row]) extends AvroDecoder {
/**
* This method supports schema changes from Schema Registry. However, the conversion between Avro records and Spark
* rows relies on RowEncoders, which are defined before the job starts. Thus, although the schema changes are supported
* while reading, they are not translated to RowEncoders, which could take to errors in the final data.
*
* Refer to the [[ScalaConfluentKafkaAvroDeserializer.deserialize()]] documentation to better understand how this
* operation is performed.
*/
def fromConfluentAvroForKey(destinationColumn: String, schemaRegistryConf: Map[String, String]): Dataset[Row] = {
val dataSchema = AvroSchemaUtils.loadForKeyAndValue(schemaRegistryConf)._1
val originalSchema = dataframe.schema
// sets the Avro schema into the destination field
val destinationIndex = originalSchema.fields.toList.indexWhere(_.name.toLowerCase == destinationColumn.toLowerCase)
originalSchema.fields(destinationIndex) = StructField(destinationColumn, SparkAvroConversions.toSqlType(dataSchema), nullable = false)
implicit val rowEncoder = AvroToRowEncoderFactory.createRowEncoder(originalSchema)
dataframe
.mapPartitions(partition => {
val avroReader = AvroReaderFactory.createConfiguredConfluentAvroReader(None, Some(schemaRegistryConf))
val avroToRowConverter = new AvroToRowConverter(None)
partition.map(avroRecord => {
val sparkType = avroToRowConverter.convert(avroReader.deserialize(avroRecord.get(destinationIndex).asInstanceOf[Array[Byte]]))
val array: Array[Any] = new Array(avroRecord.size)
for (i <- 0 until avroRecord.size) {
array(i) = avroRecord.get(i)
}
array(destinationIndex) = sparkType
Row.fromSeq(array)
})
})
}
}
}
from abris.
Hi @andybryant , sorry for the late reply.
I've faced similar issues here and solved them using once-off triggers such as explained here.
It seems the same would work for you. Since you're already launching the batch job, you could change it to a stream once with Trigger.once()
and the library would do the rest.
from abris.
from abris.
If any issues, please feel free to reopen it.
from abris.
Related Issues (20)
- ABRiS Version 3.2.2 with Spark 3.2.1 Throws Error HOT 2
- from_avro converts `\uFFFD` to a question mark HOT 1
- schema registry being called with http instead of https HOT 2
- Improve code-coverage & add GH check action HOT 1
- Fix JaCoCo CI for PRs from forked repos HOT 1
- update madrapps/jacoco-report
- Detect different schema versions in batch HOT 5
- Revert pull_request action back HOT 3
- TopicNameStrategy issue HOT 1
- Split GitHub actions for tests and test coverage
- Multiple schemas in one topic example HOT 1
- Spark 3.4 Support HOT 13
- malformed records to topic HOT 2
- foreach batch download by schem id HOT 3
- Container exited with a non-zero exit code 137 | Out of memory HOT 5
- Issues running inside Scala notebook on databricks HOT 1
- Fix tests for Spark 3.5.0
- Fix NoSuchMethodException in Spark 3.5.x
- get key from avro message HOT 3
- Compatibility with Spark 3.5 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from abris.