Comments (10)
You need to use to_avro
twice, once for key and once for value.
The key needs its own abrisConfig config that points to the correct key schema. For some schema naming strategies, you need to set the isKey
to true
so the proper shcema name is crated.
Examples of the config are here: https://github.com/AbsaOSS/ABRiS/blob/master/documentation/confluent-avro-documentation.md
from abris.
Do you have a full and working example?
Indeed, I was following https://github.com/AbsaOSS/ABRiS/blob/master/documentation/confluent-avro-documentation.md and also for the schema of the key:
val keySchema = AvroSchemaUtils.toAvroSchema(df, "key")
schemaManager.register(SchemaSubject.usingTopicNameStrategy("t", true), schema)
but then fail with:
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert SQL top-level record to Avro top-level record because schema is incompatible (sqlType = STRING, avroType = {"type":"record","name":"topLevelRecord","fields":[{"name":"brand","type":["string","null"]},{"name":"rating_mean","type":["double","null"]},{"name":"duration_mean","type":["double","null"]}]})
when trying to use the Avro string type for the string key column.
When instead of only calling to_avro once, I get the NULLed out key field (but no exception).
from abris.
When using:
.andTopicNameStrategy("t")
for a topic t
where the schema is postfixed with either: -key
or -value
I would expect that a single AbrisConfig
works.
from abris.
If it was one config, how would to_avro
function know if it is key or value?
from abris.
For example from https://github.com/AbsaOSS/ABRiS/blob/master/documentation/confluent-avro-documentation.md:
// Use latest version of already existing schema
val toAvroConfig4: ToAvroConfig = AbrisConfig
.toConfluentAvro
.downloadSchemaByLatestVersion
.andTopicNameStrategy("fooTopic")
.usingSchemaRegistry("http://registry-url")
If one follows the (default) naming convention for topic: fooTopic
, there is a:
fooTopic-key
fooTopic-value
schema.
And therefore I assume that .andTopicNameStrategy("fooTopic")
would work in both cases.
Where would you specify here if it is a key or value?
from abris.
As mentioned in other place in the documentation, the value schema is the default. If you look at the code, you will see that the method looks like this:
def andTopicNameStrategy(topicName: String, isKey: Boolean = false)
So if you want to use key schema, you must set isKey = true
from abris.
Thanks.
But when setting two different configurations:
Cannot convert SQL top-level record to Avro top-level record because schema is incompatible (sqlType = STRING, avroType = {"type":"record","name":"topLevelRecord","fields":[{"name":"brand","type":["string","null"]},{"name":"rating_mean","type":["double","null"]},{"name":"duration_mean","type":["double","null"]}]
still is the error. For an input data frame of:
|-- key_brand: binary (nullable = true)
|-- value: binary (nullable = false)
As you can see the original:
root
|-- brand: string (nullable = true)
|-- rating_mean: double (nullable = true)
|-- duration_mean: double (nullable = true)
is transformed into a single key and value column.
from abris.
Here you go with a full and self-contained example.
I have tried to follow your suggestions - however, the key wich is outputted to kafka still is nulled out!
import spark.implicits._
val aggedDf = Seq(("foo", 1.0, 1.0), ("bar", 2.0, 2.0)).toDF("brand", "rating_mean", "duration_mean")
aggedDf.printSchema
aggedDf.show
+-----+-----------+-------------+
|brand|rating_mean|duration_mean|
+-----+-----------+-------------+
| foo| 1.0| 1.0|
| bar| 2.0| 2.0|
+-----+-----------+-------------+
import za.co.absa.abris.avro.parsing.utils.AvroSchemaUtils
import za.co.absa.abris.avro.read.confluent.SchemaManagerFactory
import org.apache.avro.Schema
import za.co.absa.abris.avro.read.confluent.SchemaManager
import za.co.absa.abris.avro.registry.SchemaSubject
import za.co.absa.abris.avro.functions.to_avro
import org.apache.spark.sql._
import za.co.absa.abris.config.ToAvroConfig
// generate schema for all columns in a dataframe
val valueSchema = AvroSchemaUtils.toAvroSchema(aggedDf)
val keySchema = AvroSchemaUtils.toAvroSchema(aggedDf.select($"brand".alias("key_brand")), "key_brand")
val schemaRegistryClientConfig = Map(AbrisConfig.SCHEMA_REGISTRY_URL -> "http://localhost:8081")
val t = "metrics_per_brand_spark222xx"
val schemaManager = SchemaManagerFactory.create(schemaRegistryClientConfig)
// register schema with topic name strategy
def registerSchema1(schemaKey: Schema, schemaValue: Schema, schemaManager: SchemaManager, schemaName:String): Int = {
schemaManager.register(SchemaSubject.usingTopicNameStrategy(schemaName, true), schemaKey)
schemaManager.register(SchemaSubject.usingTopicNameStrategy(schemaName, false), schemaValue)
}
registerSchema1(keySchema, valueSchema, schemaManager, t)
val toAvroConfig4 = AbrisConfig
.toConfluentAvro
.downloadSchemaByLatestVersion
.andTopicNameStrategy(t)
.usingSchemaRegistry("http://localhost:8081")
val toAvroConfig4Key = AbrisConfig
.toConfluentAvro
.downloadSchemaByLatestVersion
.andTopicNameStrategy(t, isKey = true)
.usingSchemaRegistry("http://localhost:8081")
def writeDfToAvro(keyAvroConfig: ToAvroConfig, toAvroConfig: ToAvroConfig)(dataFrame:DataFrame) = {
// this is the key! need to keep the key to guarantee temporal ordering
val availableCols = dataFrame.columns//.drop("brand").columns
val allColumns = struct(availableCols.head, availableCols.tail: _*)
dataFrame.select(to_avro($"brand", keyAvroConfig).alias("key_brand"), to_avro(allColumns, toAvroConfig) as 'value)
// dataFrame.select($"brand".alias("key_brand"), to_avro(allColumns, toAvroConfig) as 'value)
}
val aggedAsAvro = aggedDf.transform(writeDfToAvro(toAvroConfig4Key, toAvroConfig4))
aggedAsAvro.printSchema
root
|-- key_brand: binary (nullable = true)
|-- value: binary (nullable = false)
aggedAsAvro.write
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", t).save()
from abris.
Oh, ok now I understand. Your problem is not Abris.
The value column is the only required option. If a key column is not specified then a null valued key column will be automatically added
You need to rename the key_brand
to key
.
from abris.
Indeed. Many thanks.
from abris.
Related Issues (20)
- from_avro converts `\uFFFD` to a question mark HOT 1
- schema registry being called with http instead of https HOT 2
- Improve code-coverage & add GH check action HOT 1
- Fix JaCoCo CI for PRs from forked repos HOT 1
- update madrapps/jacoco-report
- Detect different schema versions in batch HOT 5
- Revert pull_request action back HOT 3
- TopicNameStrategy issue HOT 1
- Split GitHub actions for tests and test coverage
- Multiple schemas in one topic example HOT 1
- Spark 3.4 Support HOT 13
- malformed records to topic HOT 2
- foreach batch download by schem id HOT 3
- Container exited with a non-zero exit code 137 | Out of memory HOT 5
- Issues running inside Scala notebook on databricks HOT 1
- Fix tests for Spark 3.5.0
- Fix NoSuchMethodException in Spark 3.5.x
- get key from avro message HOT 3
- Compatibility with Spark 3.5 HOT 3
- Version 6.4.0 failing for Spark 3.5.0 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from abris.