Comments (11)
Hi! Thanks for reporting this issue. However I'd like to discuss your questions a bit more.
In Q1 you mentioned that
when the snapshot corresponding to the timestamp expires, an error will be reported
What error is reported under what condition? Could you explain what you're doing in detail when the error occurs? Could you add a complete exception stack in the issue description?
You also mentioned that
scan.timestamp-millis > snapshot latest timestamp, then scans-mode = 'latest'
As far as I remember, our current implementation has exactly the same effect. Is the actual behavior different? If yes, what is the actual behavior?
In Q2 you said that
The parallelism is greater than the number of buckets during stream read, and the excess slots are not used
This is expected for Flink sources. Users should either decrease the source parallelism or change bucket number to a larger value. Or are you suggesting something different? What would you like to suggest for this scenario?
from paimon.
Hi @tsreaper in Q1, user start streaming read give some fixed timestamp in sql, but if the the sql task restart then the error occurs
and in Q2 it`s expected in non partitioned table but for partiton table stream read should shuffle by partitons maybe
from paimon.
Thanks @wxplovecc Which version are you using? Can you give the exception stack about Q1?
If I understand correctly, you submit a streaming job with paimon source which reads incremental data. If the source tasks of job restarted due to the failover, I think it will be restored with snapshot and splits from state instead of the fixed timestamp.
Of course, jobs may use memory state and source tasks will read from the fixed timestamp after restarted. But as @tsreaper mentioned above, the source subtask should read specific snapshot according to the timestamp from paimon without exception.
from paimon.
@FangYongs We are use the Master branch, you are right we restart the job without checkpoint . Do you think we should offer some strategy for this senior(restart without state) ?
from paimon.
@wxplovecc What's the strategy do you mean for restarting without state? Could you describe it in detail? If a job has no state, I think it will calculate snapshot id according to the fixed timestamp without throwing an exception after restart
from paimon.
@FangYongs Maybe like this
scan.timestamp-millis
< snapshot earliest timestamp, then read from EARLIEST
scan.timestamp-millis
> snapshot latest timestamp, then scans-mode = 'latest'
when the fixed timestamp was illegality
or should we introduce consumer group like kafka, independent maintenance of consumption offsets
from paimon.
As @tsreaper mentioned above, I think the current strategy of scan.timestamp-millis
in paimon is just as you described:
scan.timestamp-millis < snapshot earliest timestamp, then read from EARLIEST
scan.timestamp-millis > snapshot latest timestamp, then scans-mode = 'latest'
If it doesn't act as above, I think it's a bug
from paimon.
it`s expected in non partitioned table but for partiton table stream read should shuffle by partitons maybe
Paimon Flink connector already supports this feature. If you set sink.partition-shuffle
to true then the records will be shuffled both by bucket and by partition. See https://paimon.apache.org/docs/master/maintenance/configurations/#flinkconnectoroptions for more detail.
@JingsongLi maybe we should remove this option and always shuffle by both bucket and partition. I don't see any disadvantage if we do so.
from paimon.
@FangYongs I think so. Not in genPlan phase. Below was the stack
The root case was the source task expire snapshot when stream read task startup
from paimon.
it`s expected in non partitioned table but for partiton table stream read should shuffle by partitons maybe
Paimon Flink connector already supports this feature. If you set
sink.partition-shuffle
to true then the records will be shuffled both by bucket and by partition. See https://paimon.apache.org/docs/master/maintenance/configurations/#flinkconnectoroptions for more detail.@JingsongLi maybe we should remove this option and always shuffle by both bucket and partition. I don't see any disadvantage if we do so.
Sink has already supported it, but source dose not.
from paimon.
done
from paimon.
Related Issues (20)
- [Question] What are the risks associated with the Java API?
- [Question] some question about hive integration
- [Feature] support watermark in batch mode
- [Feature] Introduce key-value cache for paimon lookup operator in flink
- [doc] update totalRecordCount and deltaRecordCount in understand-files.md
- [Bug] The partition expire was not correctly triggered during the commit execution. HOT 3
- [Feature] lookup join performance is particularly poor in the case of cache penetration HOT 3
- [Feature] modify column first not throws exception when column is first.
- [Feature] Flink writing branch supports alter table
- [Feature] Delete old schema file
- [Bug] Infinite loop after using online schema change to add a new column and update values HOT 1
- [Feature] Spark3.2 insertoverwrite test support
- [Bug] Serializers were created infinitely and caused OOM for 0.8 HOT 4
- [Feature] Support maximum compaction concurrency control in Compact Database Action
- [Feature] paimon-mysql-cdc supports parsing gh-ost ddl records
- [Bug] streaming read by `from-timestamp` may be occur exception(snapshot file not fund) sometimes
- [Feature] HiveCatalog support create/alter/rename table with upper case
- [Bug] Invalid partition filter may cause ArrayIndexOutOfBoundsException in files system table HOT 2
- [Bug] Creating a tag directory is not a perfect fix
- [Feature] Support creating database with properties in flink
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from paimon.