GithubHelp home page GithubHelp logo

Comments (11)

tsreaper avatar tsreaper commented on June 7, 2024

Hi! Thanks for reporting this issue. However I'd like to discuss your questions a bit more.

In Q1 you mentioned that

when the snapshot corresponding to the timestamp expires, an error will be reported

What error is reported under what condition? Could you explain what you're doing in detail when the error occurs? Could you add a complete exception stack in the issue description?

You also mentioned that

scan.timestamp-millis > snapshot latest timestamp, then scans-mode = 'latest'

As far as I remember, our current implementation has exactly the same effect. Is the actual behavior different? If yes, what is the actual behavior?

In Q2 you said that

The parallelism is greater than the number of buckets during stream read, and the excess slots are not used

This is expected for Flink sources. Users should either decrease the source parallelism or change bucket number to a larger value. Or are you suggesting something different? What would you like to suggest for this scenario?

from paimon.

wxplovecc avatar wxplovecc commented on June 7, 2024

Hi @tsreaper in Q1, user start streaming read give some fixed timestamp in sql, but if the the sql task restart then the error occurs
and in Q2 it`s expected in non partitioned table but for partiton table stream read should shuffle by partitons maybe

from paimon.

FangYongs avatar FangYongs commented on June 7, 2024

Thanks @wxplovecc Which version are you using? Can you give the exception stack about Q1?
If I understand correctly, you submit a streaming job with paimon source which reads incremental data. If the source tasks of job restarted due to the failover, I think it will be restored with snapshot and splits from state instead of the fixed timestamp.

Of course, jobs may use memory state and source tasks will read from the fixed timestamp after restarted. But as @tsreaper mentioned above, the source subtask should read specific snapshot according to the timestamp from paimon without exception.

from paimon.

wxplovecc avatar wxplovecc commented on June 7, 2024

@FangYongs We are use the Master branch, you are right we restart the job without checkpoint . Do you think we should offer some strategy for this senior(restart without state) ?

from paimon.

FangYongs avatar FangYongs commented on June 7, 2024

@wxplovecc What's the strategy do you mean for restarting without state? Could you describe it in detail? If a job has no state, I think it will calculate snapshot id according to the fixed timestamp without throwing an exception after restart

from paimon.

wxplovecc avatar wxplovecc commented on June 7, 2024

@FangYongs Maybe like this
scan.timestamp-millis < snapshot earliest timestamp, then read from EARLIEST
scan.timestamp-millis > snapshot latest timestamp, then scans-mode = 'latest'
when the fixed timestamp was illegality

or should we introduce consumer group like kafka, independent maintenance of consumption offsets

from paimon.

FangYongs avatar FangYongs commented on June 7, 2024

As @tsreaper mentioned above, I think the current strategy of scan.timestamp-millis in paimon is just as you described:

scan.timestamp-millis < snapshot earliest timestamp, then read from EARLIEST
scan.timestamp-millis > snapshot latest timestamp, then scans-mode = 'latest'

If it doesn't act as above, I think it's a bug

from paimon.

tsreaper avatar tsreaper commented on June 7, 2024

it`s expected in non partitioned table but for partiton table stream read should shuffle by partitons maybe

Paimon Flink connector already supports this feature. If you set sink.partition-shuffle to true then the records will be shuffled both by bucket and by partition. See https://paimon.apache.org/docs/master/maintenance/configurations/#flinkconnectoroptions for more detail.

@JingsongLi maybe we should remove this option and always shuffle by both bucket and partition. I don't see any disadvantage if we do so.

from paimon.

wxplovecc avatar wxplovecc commented on June 7, 2024

@FangYongs I think so. Not in genPlan phase. Below was the stack
企业微信截图_16800578569914

The root case was the source task expire snapshot when stream read task startup

from paimon.

JingsongLi avatar JingsongLi commented on June 7, 2024

it`s expected in non partitioned table but for partiton table stream read should shuffle by partitons maybe

Paimon Flink connector already supports this feature. If you set sink.partition-shuffle to true then the records will be shuffled both by bucket and by partition. See https://paimon.apache.org/docs/master/maintenance/configurations/#flinkconnectoroptions for more detail.

@JingsongLi maybe we should remove this option and always shuffle by both bucket and partition. I don't see any disadvantage if we do so.

Sink has already supported it, but source dose not.

from paimon.

siyangzeng avatar siyangzeng commented on June 7, 2024

done

from paimon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.