Deion Cluster: 1 coordinator, 3 workers Trino version: 4

Many small data files created for a table, unable to optimize the number of data files about trino HOT 2 OPEN

jhatcher1 commented on August 16, 2024

Many small data files created for a table, unable to optimize the number of data files

from trino.

Comments (2)

findinpath commented on August 16, 2024

I tried to reproduce exactly the scenario you pointed out and did have initially the following files

content |                                                                            file_path                                                                            | file_format | record_count | file_size_in_bytes |       column_sizes   >
---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+--------------+--------------------+---------------------->
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-27602614-48a2-46eb-9baa-b6db886c2e23.parquet | PARQUET     |       874421 |            1752503 | {1=889537, 2=862660} >
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-3c85df99-c2ab-47d5-9253-e330fcb9c711.parquet | PARQUET     |      2262365 |            4518053 | {1=2301435, 2=2216305>
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-fc9ace1d-b5c5-4fc1-83f8-0f41ecdf966c.parquet | PARQUET     |      4015604 |            8040134 | {1=4084631, 2=3955190>
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-252522f7-225d-47ab-b119-8f9ccc692251.parquet | PARQUET     |      6275254 |           12568807 | {1=6382707, 2=6185785>
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-cb981908-7b63-4a96-a209-6882586fb0b2.parquet | PARQUET     |      9906127 |           19865711 | {1=10076133, 2=978926>
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-0bdc5507-edec-4ebd-8e05-6a4f5c09e142.parquet | PARQUET     |     16666230 |           33443003 | {1=16946156, 2=164965>
(6 rows)

After optimize however there was only one file

trino> SELECT * from iceberg.default."t1$files";
 content |                                                                            file_path                                                                            | file_format | record_count | file_size_in_bytes |       column_sizes   >
---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+--------------+--------------------+---------------------->
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081901_00002_t8p79-03481c56-03ef-46da-9482-0bdc1c8f7eb4.parquet | PARQUET     |     40000001 |           80273843 | {1=40660676, 2=396128>
(1 row)

from trino.

jhatcher1 commented on August 16, 2024

After trying some things, I think it might be related to the number of workers in the cluster. When I scaled down to a single worker I was able to optimize down to one file, but with 3 workers I could only get it down to 3 files.

from trino.

Many small data files created for a table, unable to optimize the number of data files about trino HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs