Comments (14)
I think I am mixing up two things here.
- I want a Dataset API.
- This library doesn't seem to work with Structured Streaming even with Dataframe.
Exception in thread "main" org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.Dataset.write(Dataset.scala:3103)
from abris.
Hi Zhen-hao,
Foremost, thank you very much for your message.
Things are definitely getting mixed here, since working with structured APIs is exactly what this library is about. :)
Four points:
-
Have you tried to run the example presented here ?
-
Your message is telling that write cannot be called on Dataset/Dataframe, which means it is being called after the Dataframe/Dataframe has been returned from the library.
-
Can you attach your code?
-
Of course we're open to PRs. That will be a pleasure.
Best regards.
from abris.
Hi Felip,
thanks for your fast response!
Yes, the error was gone when I replaced .save with .start. It seems you can't use save in structured streaming.
I think I was confused a lot by Structured Streaming, instead of your library. For example, I didn't see the normal Spark logs after I run the job. So I wasn't sure if the job started or not.
When I add logging and write to the console, sometimes I get the logs and sometimes not.
I will look into it further after my holiday and look forward to using this library in our production jobs.
from abris.
Great to read this, Zhen-hao. Just let me know if you stumble upon any issue.
from abris.
Hi Felip,
After a deeper look at it, I realized that the Dataframe API is not working for me because after the .toDF operation the serializer picks up each row as bytes.
The schema generated will always be the following no matter what I put to Dataset; I tried case class and org.apache.avro.generic.GenericRecord.
{"subject":"topic_name","version":4,"id":2,"schema":"{\"type\":\"record\",\"name\":\"MyCaseClass\",\"namespace\":\"io.connecterra\"
,\"fields\":[{\"name\":\"value\",\"type\":[\"bytes\",\"null\"]}]}"}
I think I will have to implement something generic for Dataset[T].
from abris.
Hey Zhen-hao, thanks for pinging. May I have a look at your code? Luckily I've already stumbled upon something similar.
from abris.
Essentially, my code is
val output =
myDS
.toDF
.toConfluentAvro(TOPIC, "myTopic", "io.connecterra")(SchemaRegistryConfs)
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS)
.option("topic", TOPIC)
.option("checkpointLocation", "./checkpoint")
.start()
output.awaitTermination()
from abris.
I see that your library relies on the createConverterToAvro function from Databricks. I am not sure I want to build anything on that. I think I will try to use the avro4s library and some type-classes to make it work for Dataset[T].
from abris.
Hi Zhen-hao,
Not sure I got your issue. When you run myDS.toDF
you should be able to retain the original schema of myDS
. On library side, it will only be converted to Avro after toConfluentAvro
.
Regarding createConverterToAvro
from Databricks, what are the issues you see?
from abris.
I am using Spark 2.3.
In general, I saw lots of anti-patterns and bad coding styles in Databricks code, in my standard of course.
from abris.
Ok, I see. thanks then.
from abris.
my approach would be fairly simple. because I already have the transformation from Dataset[T] to Dataset[GenericRecard] for free thanks to the avro4s library, all I need is a module to talk to the schema registry and add the id bytes in the Array[Byte]. I think you already have that code ready for me to use :)
from abris.
Cool, happy it's solved :)
from abris.
Closing since it seems to be solved for now.
from abris.
Related Issues (20)
- from_avro converts `\uFFFD` to a question mark HOT 1
- schema registry being called with http instead of https HOT 2
- Improve code-coverage & add GH check action HOT 1
- Fix JaCoCo CI for PRs from forked repos HOT 1
- update madrapps/jacoco-report
- Detect different schema versions in batch HOT 5
- Revert pull_request action back HOT 3
- TopicNameStrategy issue HOT 1
- Split GitHub actions for tests and test coverage
- Multiple schemas in one topic example HOT 1
- Spark 3.4 Support HOT 13
- malformed records to topic HOT 2
- foreach batch download by schem id HOT 3
- Container exited with a non-zero exit code 137 | Out of memory HOT 5
- Issues running inside Scala notebook on databricks HOT 1
- Fix tests for Spark 3.5.0
- Fix NoSuchMethodException in Spark 3.5.x
- get key from avro message HOT 3
- Compatibility with Spark 3.5 HOT 3
- Version 6.4.0 failing for Spark 3.5.0 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from abris.