Comments (2)
I tried to reproduce exactly the scenario you pointed out and did have initially the following files
content | file_path | file_format | record_count | file_size_in_bytes | column_sizes >
---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+--------------+--------------------+---------------------->
0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-27602614-48a2-46eb-9baa-b6db886c2e23.parquet | PARQUET | 874421 | 1752503 | {1=889537, 2=862660} >
0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-3c85df99-c2ab-47d5-9253-e330fcb9c711.parquet | PARQUET | 2262365 | 4518053 | {1=2301435, 2=2216305>
0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-fc9ace1d-b5c5-4fc1-83f8-0f41ecdf966c.parquet | PARQUET | 4015604 | 8040134 | {1=4084631, 2=3955190>
0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-252522f7-225d-47ab-b119-8f9ccc692251.parquet | PARQUET | 6275254 | 12568807 | {1=6382707, 2=6185785>
0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-cb981908-7b63-4a96-a209-6882586fb0b2.parquet | PARQUET | 9906127 | 19865711 | {1=10076133, 2=978926>
0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-0bdc5507-edec-4ebd-8e05-6a4f5c09e142.parquet | PARQUET | 16666230 | 33443003 | {1=16946156, 2=164965>
(6 rows)
After optimize
however there was only one file
trino> SELECT * from iceberg.default."t1$files";
content | file_path | file_format | record_count | file_size_in_bytes | column_sizes >
---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+--------------+--------------------+---------------------->
0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081901_00002_t8p79-03481c56-03ef-46da-9482-0bdc1c8f7eb4.parquet | PARQUET | 40000001 | 80273843 | {1=40660676, 2=396128>
(1 row)
from trino.
After trying some things, I think it might be related to the number of workers in the cluster. When I scaled down to a single worker I was able to optimize down to one file, but with 3 workers I could only get it down to 3 files.
from trino.
Related Issues (20)
- Trino postgres connector does not support modifying table rows HOT 1
- enabled filesystem caching query error HOT 1
- MySQL slow in 447 HOT 3
- High Memory usage after 444 version with iceberg connector HOT 4
- Regression: Sql planner fails with unexpected parameters for function $try_cast HOT 2
- [Support] Need Help with Kubernetes and Trino Setup
- set Customized session properties
- Flaky test TestSharedHiveMetastore.testReadInformationSchema
- Trino Query getting Hung
- Flaky test TestHiveCoercionOnUnpartitionedTable.testHiveCoercionParquet: incorrect results HOT 2
- JVM crash with SIGSEGV during build on CI HOT 9
- Return information about hive unsupported views and tables during listing HOT 2
- Improve performance of SortedRangeSet discrete union HOT 5
- HDFS Exchange Manager Kerberos Authentication Issue for Fault-tolerant execution - SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]
- K8S native file cache support
- spill is not work in Joins
- Revert workaround for JDK-8329528
- Error fetching result when MAP type contains ARRAY keys HOT 1
- [BUG][HIVE-CONNECT] - Failed to Execute Simple CATS Operation in Trino Due to Hive executeUpdateStatisticsOperations Issue HOT 1
- CREATE CATALOG with Access Control HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from trino.