Comments (7)
Hello,
Could you print d2.schema()
and post it here?
What version of org.apache.spark:spark-avro
library is on your classpath?
from abris.
Hi @joserivera1990 Unfortunately, your avro schema is invalid. Per the avro spec, "Scale must be [...] less than or equal to the precision.", see https://avro.apache.org/docs/current/spec.html#Decimal, which is not the case (127 > 64).
What happens is that the logical type (Decimal) is not validated when the schema is parsed and instead falls back to null. So, Spark doesn't see the avro logical type decimal and interprets it as a BinaryType instead of a DecimalType.
To solve the issue, you could
- fix the schema generation process
- or change the avro schema on schema registry
- or provide the fixed reader schema to Abris using the
.provideReaderSchema
config method (see https://github.com/AbsaOSS/ABRiS/blob/master/documentation/confluent-avro-documentation.md#avro-to-spark for examples)
from abris.
Hello,
Could you print
d2.schema()
and post it here?What version of
org.apache.spark:spark-avro
library is on your classpath?
Hi @cerveada,
I added the print to d2.schema()
StructType(StructField(STFAMPRO,StringType,true), StructField(CHFAMPRO,StringType,true), StructField(TEST_NUMBER,BinaryType,true), StructField(TEST_NUMBER_DECIMAL,BinaryType,true), StructField(table,StringType,true), StructField(SCN_CMD,StringType,true), StructField(OP_TYPE_CMD,StringType,true), StructField(op_ts,StringType,true), StructField(current_ts,StringType,true), StructField(row_id,StringType,true), StructField(username,StringType,true))
Checking the external libraries I have -> org.apache.spark:spark-avro_2.12:2.4.8
Regards.
from abris.
Hi @joserivera1990 Unfortunately, your avro schema is invalid. Per the avro spec, "Scale must be [...] less than or equal to the precision.", see https://avro.apache.org/docs/current/spec.html#Decimal, which is not the case (127 > 64). What happens is that the logical type (Decimal) is not validated when the schema is parsed and instead falls back to null. So, Spark doesn't see the avro logical type decimal and interprets it as a BinaryType instead of a DecimalType.
To solve the issue, you could
- fix the schema generation process
- or change the avro schema on schema registry
- or provide the fixed reader schema to Abris using the
.provideReaderSchema
config method (see https://github.com/AbsaOSS/ABRiS/blob/master/documentation/confluent-avro-documentation.md#avro-to-spark for examples)
Hi @kevinwallimann,
I got your points, about changing the schema generation process, this schema is generated by Confluent on the connector CDC Oracle
https://docs.confluent.io/kafka-connect-oracle-cdc/current/troubleshooting.html#numeric-data-type-with-no-precision-or-scale-results-in-unreadable-output
I did the test using the function .provideReaderSchema
and setting in the schema the values: precicion:38 and scale:10
{\"name\":\"TEST_NUMBER\",\"type\":[\"null\",{\"type\":\"bytes\",\"scale\":10,\"precision\":38,\"connect.version\":1,\"connect.parameters\":{\"scale\":\"10\"},\"connect.name\":\"org.apache.kafka.connect.data.Decimal\",\"logicalType\":\"decimal\"}],\"default\":null}
And throw the next error: Decimal precision 128 exceeds max precision 38
Finally, I think that should have some way to get the number with scale of 127 and precicion of 64, I don't know if as a string in place of decimal. Por example, I'm using the connector com.snowflake.kafka.connector.SnowflakeSinkConnector
and in the properties in value.converter io.confluent.connect.avro.AvroConverter
In the Snowflake DataBase that row is saved in this way:
{
"STFAMPRO": "AA",
"TEST_NUMBER": "5.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000",
"TEST_NUMBER_DECIMAL": "5.1500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000",
"current_ts": "1651059748133",
}
The problem is than I don't know as Snowflake implements this connector com.snowflake.kafka.connector.SnowflakeSinkConnector
Thanks for your time!
from abris.
Hi @joserivera1990 I see, the problem now is that your data has decimal precision 128 which is larger than the maximum that Spark supports (38). In this case, Spark uses the BinaryType as a fallback. You could try to convert the BinaryType to a human-readable format after it's already in a Spark Dataframe. I think this problem should be solved outside of Abris.
Just for the sake of completeness, there is a way to have your own custom logic to convert from Avro to a Spark Dataframe in Abris, see https://github.com/AbsaOSS/ABRiS#custom-data-conversions. However, it's quite involved and I wouldn't recommend you that approach.
from abris.
Hi @joserivera1990 I see, the problem now is that your data has decimal precision 128 which is larger than the maximum that Spark supports (38). In this case, Spark uses the BinaryType as a fallback. You could try to convert the BinaryType to a human-readable format after it's already in a Spark Dataframe. I think this problem should be solved outside of Abris. Just for the sake of completeness, there is a way to have your own custom logic to convert from Avro to a Spark Dataframe in Abris, see https://github.com/AbsaOSS/ABRiS#custom-data-conversions. However, it's quite involved and I wouldn't recommend you that approach.
Hi @kevinwallimann , I will review your advices. Thanks for your time!
from abris.
Hi everyone,
This issue was solved from Connector CDC Oracle version 2.0.0 adding the property numeric.mapping
with value best_fit_or_decimal
.
https://docs.confluent.io/kafka-connect-oracle-cdc/current/configuration-properties.html
Explication Conector Confluent:
"Use best_fit_or_decimal if NUMERIC columns should be cast to Connect’s primitive type based upon the column’s precision and scale. If the precision and scale exceed the bounds for any primitive type, Connect’s DECIMAL logical type will be used instead."
In this way when the column in Oracle is Numeric without precision or scale the schema registry added the field as double. The connector just will use decimalType if the number value is major than double type maximun number.
My new schema registry version is:
{"type":"record","name":"ConnectDefault","namespace":"io.confluent.connect.avro","fields":[{"name":"STFAMPRO","type":["null","string"],"default":null},{"name":"CHFAMPRO","type":["null","string"],"default":null},{"name":"**TEST_NUMBER**","type":["null","**double**"],"default":null},{"name":"**TEST_NUMBER_DECIMAL**","type":["null","**double**"],"default":null},{"name":"table","type":["null","string"],"default":null},{"name":"SCN_CMD","type":["null","string"],"default":null},{"name":"OP_TYPE_CMD","type":["null","string"],"default":null},{"name":"op_ts","type":["null","string"],"default":null},{"name":"current_ts","type":["null","string"],"default":null},{"name":"row_id","type":["null","string"],"default":null},{"name":"username","type":["null","string"],"default":null}]}
I will close this issue. Thanks everyone!
from abris.
Related Issues (20)
- from_avro converts `\uFFFD` to a question mark HOT 1
- schema registry being called with http instead of https HOT 2
- Improve code-coverage & add GH check action HOT 1
- Fix JaCoCo CI for PRs from forked repos HOT 1
- update madrapps/jacoco-report
- Detect different schema versions in batch HOT 5
- Revert pull_request action back HOT 3
- TopicNameStrategy issue HOT 1
- Split GitHub actions for tests and test coverage
- Multiple schemas in one topic example HOT 1
- Spark 3.4 Support HOT 13
- malformed records to topic HOT 2
- foreach batch download by schem id HOT 3
- Container exited with a non-zero exit code 137 | Out of memory HOT 5
- Issues running inside Scala notebook on databricks HOT 1
- Fix tests for Spark 3.5.0
- Fix NoSuchMethodException in Spark 3.5.x
- get key from avro message HOT 3
- Compatibility with Spark 3.5 HOT 3
- Version 6.4.0 failing for Spark 3.5.0 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from abris.