GithubHelp home page GithubHelp logo

Comments (5)

Leomrlin avatar Leomrlin commented on September 22, 2024

@qingfei1994
Hi there,
Thank you for highlighting this important point. The current limitation of having only StringDeserializer indeed constrains the flexibility in processing data from Kafka topics. Directly supporting JSON deserialization in the Kafka source would be a valuable enhancement.

The use case you mentioned, where strings need to be parsed to identify vertices or edges, is quite common, and having structured data upstream could greatly simplify the processing workflow. Not only would this result in a more robust data ingestion process, but it would also coincide with the trend of JSON being widely adopted as the data interchange format in various systems and services.

If you're interested in contributing this feature, I believe it would be a very welcome improvement. Should you have any design ideas, feel free to outline the design and implementation details, and engage with the community to discuss and refine the concept.

Once again, thank you for your initiative and for contributing to the TuGraph project. If you need any further information or have additional questions, please do not hesitate to reach out.

Looking forward to your proposal!
Best wishes!

from tugraph-analytics.

qingfei1994 avatar qingfei1994 commented on September 22, 2024

Thanks! @Leomrlin
My rough idea is to add some options for kafka source.

1."geaflow.dsl.kafka.format". It can be configured as text or json. If it's configured as "json", kafka message will be deserialized as json and return a collection of Row according to the table schema within the fetch function of KafkaSourceTable.

@Override
public <T> FetchData<T> fetch(Partition partition, Optional<Offset> startOffset,
                              long newWindowSize) throws IOException {

Error may occur when deserializing json, so need to add more options like 'geaflow.dsl.kafka.format.json.fail-on-missing-field'(true/false), 'geaflow.dsl.kafka.format.json.ignore-parse-error'.

  1. In KafkaSouceTable getDeserializer() will be returned as RowTableDeserializer for "json" case, so that it can return a row.
@Override
public <IN> TableDeserializer<IN> getDeserializer(Configuration conf) {
    
    // return (TableDeserializer<IN>) new TextDeserializer();
}

from tugraph-analytics.

Leomrlin avatar Leomrlin commented on September 22, 2024

@qingfei1994
Hello,

Your proposal for enhancing the JSON deserialization capabilities within our Kafka source is superb! I see it as not just a solution to the immediate needs but also as laying the groundwork for a more robust deserialization framework.

Our existing system's TableDeserializer interface is perfectly suited to integrate different parsers, similar to the current TextDeserializer. By incorporating the JSON deserializer at this level, we ensure that it can be utilized across different connectors within the TuGraph project, not just limited to Kafka.

Regarding the configuration options, I concur that the deserializer may have numerous parameters in the future that could further define its functionalities. However, I suggest that adjusting the configuration to a more general level, such as geaflow.dsl.connector.format.json, could be a strategic approach. This would enable future JSON parsers to maintain configuration consistency across different connectors and simplify the management of settings related to JSON deserialization. Perhaps in the future, we can provide a uniform and flexible JSON deserialization error handling strategy for the entire TuGraph system.

Once again, thank you for your proactive attitude and for contributing such thoughtful ideas to the TuGraph project. We are very much looking forward to your detailed design and subsequent implementation. Should you need any assistance or have further questions as you develop this feature, please do not hesitate to reach out to the community.

from tugraph-analytics.

qingfei1994 avatar qingfei1994 commented on September 22, 2024

Thanks @Leomrlin!
As you said, there may be numerous parameters to further define the deserialization configuration. So my idea is the configuration for json format itself could be a more general level, as other system like Pulsar may also need a json deserializer configuration.
Something like this.

geaflow.dsl.connector.format : json/text
geaflow.dsl.connector.format.json.ignore-parse-error: true/false
geaflow.dsl.connector.format.json.fail-on-missing-field: true/false

What do you think?

from tugraph-analytics.

Leomrlin avatar Leomrlin commented on September 22, 2024

Thanks @Leomrlin! As you said, there may be numerous parameters to further define the deserialization configuration. So my idea is the configuration for json format itself could be a more general level, as other system like Pulsar may also need a json deserializer configuration. Something like this.

geaflow.dsl.connector.format : json/text
geaflow.dsl.connector.format.json.ignore-parse-error: true/false
geaflow.dsl.connector.format.json.fail-on-missing-field: true/false

What do you think?

Good!

I fully agree with your view on abstracting the configuration to a more general level, such as geaflow.dsl.connector.format.json being a possible strategic move. This would ensure that future JSON parsers can maintain consistent configuration across different connectors and make it easier to manage settings related to JSON deserialization.

Such a design would allow us to integrate these configuration options into each of the connectors within TuGraph, not just Kafka or Pulsar.

I support your continued efforts to advance this proposal. If you need more feedback or encounter challenges in implementing these features, please do not hesitate to share with us immediately. The TuGraph community is always eager to help and move forward together.

Looking forward to seeing your further progress!

from tugraph-analytics.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.