GithubHelp home page GithubHelp logo

Comments (14)

Zhen-hao avatar Zhen-hao commented on May 26, 2024

I think I am mixing up two things here.

  1. I want a Dataset API.
  2. This library doesn't seem to work with Structured Streaming even with Dataframe.
Exception in thread "main" org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame;
	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
	at org.apache.spark.sql.Dataset.write(Dataset.scala:3103)

from abris.

felipemmelo avatar felipemmelo commented on May 26, 2024

Hi Zhen-hao,

Foremost, thank you very much for your message.

Things are definitely getting mixed here, since working with structured APIs is exactly what this library is about. :)

Four points:

  1. Have you tried to run the example presented here ?

  2. Your message is telling that write cannot be called on Dataset/Dataframe, which means it is being called after the Dataframe/Dataframe has been returned from the library.

  3. Can you attach your code?

  4. Of course we're open to PRs. That will be a pleasure.

Best regards.

from abris.

Zhen-hao avatar Zhen-hao commented on May 26, 2024

Hi Felip,
thanks for your fast response!
Yes, the error was gone when I replaced .save with .start. It seems you can't use save in structured streaming.
I think I was confused a lot by Structured Streaming, instead of your library. For example, I didn't see the normal Spark logs after I run the job. So I wasn't sure if the job started or not.
When I add logging and write to the console, sometimes I get the logs and sometimes not.

I will look into it further after my holiday and look forward to using this library in our production jobs.

from abris.

felipemmelo avatar felipemmelo commented on May 26, 2024

Great to read this, Zhen-hao. Just let me know if you stumble upon any issue.

from abris.

Zhen-hao avatar Zhen-hao commented on May 26, 2024

Hi Felip,
After a deeper look at it, I realized that the Dataframe API is not working for me because after the .toDF operation the serializer picks up each row as bytes.
The schema generated will always be the following no matter what I put to Dataset; I tried case class and org.apache.avro.generic.GenericRecord.

{"subject":"topic_name","version":4,"id":2,"schema":"{\"type\":\"record\",\"name\":\"MyCaseClass\",\"namespace\":\"io.connecterra\"
,\"fields\":[{\"name\":\"value\",\"type\":[\"bytes\",\"null\"]}]}"}

I think I will have to implement something generic for Dataset[T].

from abris.

felipemmelo avatar felipemmelo commented on May 26, 2024

Hey Zhen-hao, thanks for pinging. May I have a look at your code? Luckily I've already stumbled upon something similar.

from abris.

Zhen-hao avatar Zhen-hao commented on May 26, 2024

Essentially, my code is

val output =
    myDS
      .toDF
      .toConfluentAvro(TOPIC, "myTopic", "io.connecterra")(SchemaRegistryConfs)
      .writeStream
      .format("kafka")
      .option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS)
      .option("topic", TOPIC)
      .option("checkpointLocation", "./checkpoint")
      .start()

output.awaitTermination()

from abris.

Zhen-hao avatar Zhen-hao commented on May 26, 2024

I see that your library relies on the createConverterToAvro function from Databricks. I am not sure I want to build anything on that. I think I will try to use the avro4s library and some type-classes to make it work for Dataset[T].

from abris.

felipemmelo avatar felipemmelo commented on May 26, 2024

Hi Zhen-hao,

Not sure I got your issue. When you run myDS.toDF you should be able to retain the original schema of myDS. On library side, it will only be converted to Avro after toConfluentAvro.

Regarding createConverterToAvro from Databricks, what are the issues you see?

from abris.

Zhen-hao avatar Zhen-hao commented on May 26, 2024

I am using Spark 2.3.
In general, I saw lots of anti-patterns and bad coding styles in Databricks code, in my standard of course.

from abris.

felipemmelo avatar felipemmelo commented on May 26, 2024

Ok, I see. thanks then.

from abris.

Zhen-hao avatar Zhen-hao commented on May 26, 2024

my approach would be fairly simple. because I already have the transformation from Dataset[T] to Dataset[GenericRecard] for free thanks to the avro4s library, all I need is a module to talk to the schema registry and add the id bytes in the Array[Byte]. I think you already have that code ready for me to use :)

from abris.

felipemmelo avatar felipemmelo commented on May 26, 2024

Cool, happy it's solved :)

from abris.

felipemmelo avatar felipemmelo commented on May 26, 2024

Closing since it seems to be solved for now.

from abris.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.