GithubHelp home page GithubHelp logo

duckdb_iceberg's Introduction

DuckDB logo

Github Actions Badge discord Latest Release

DuckDB

DuckDB is a high-performance analytical database system. It is designed to be fast, reliable, portable, and easy to use. DuckDB provides a rich SQL dialect, with support far beyond basic SQL. DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (arrays, structs, maps), and several extensions designed to make SQL easier to use.

DuckDB is available as a standalone CLI application and has clients for Python, R, Java, Wasm, etc., with deep integrations with packages such as pandas and dplyr.

For more information on using DuckDB, please refer to the DuckDB documentation.

Installation

If you want to install DuckDB, please see our installation page for instructions.

Data Import

For CSV files and Parquet files, data import is as simple as referencing the file in the FROM clause:

SELECT * FROM 'myfile.csv';
SELECT * FROM 'myfile.parquet';

Refer to our Data Import section for more information.

SQL Reference

The documentation contains a SQL introduction and reference.

Development

For development, DuckDB requires CMake, Python3 and a C++11 compliant compiler. Run make in the root directory to compile the sources. For development, use make debug to build a non-optimized debug version. You should run make unit and make allunit to verify that your version works properly after making changes. To test performance, you can run BUILD_BENCHMARK=1 BUILD_TPCH=1 make and then perform several standard benchmarks from the root directory by executing ./build/release/benchmark/benchmark_runner. The details of benchmarks are in our Benchmark Guide.

Please also refer to our Build Guide and Contribution Guide.

Support

See the Support Options page.

duckdb_iceberg's People

Contributors

carlopi avatar dpgrev avatar fokko avatar mytherin avatar rustyconover avatar samansmink avatar szarnyasg avatar teaguesterling avatar tishj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

duckdb_iceberg's Issues

Unable to build project using vcpkg

I am trying to build the project using vcpkg (first time, so probably missing something), however I don't get past the avrocpp_s stage.
Any idea what I might be doing wrong? I tried also vckpg install in both the root as bcpkg_ports/avro-cpp directory.

ramon@ramon-ThinkPad-X1-Carbon-Gen-9~/Desktop/PROJECTS/duckdb_iceberg: VCPKG_TOOLCHAIN_PATH='/home/ramon/Desktop/PROJECTS/duckdb_iceberg/vcpkg/scripts/buildsystems/vcpkg.cmake' make
mkdir -p build/release && \
cmake  -DEXTENSION_STATIC_BUILD=1 -DDUCKDB_EXTENSION_CONFIGS='/home/ramon/Desktop/PROJECTS/duckdb_iceberg/extension_config.cmake'  -DOSX_BUILD_ARCH=  -DVCPKG_MANIFEST_DIR='/home/ramon/Desktop/PROJECTS/duckdb_iceberg/' -DVCPKG_BUILD=1 -DCMAKE_TOOLCHAIN_FILE='/home/ramon/Desktop/PROJECTS/duckdb_iceberg/vcpkg/scripts/buildsystems/vcpkg.cmake' -DDUCKDB_EXPLICIT_PLATFORM=''  -DCMAKE_BUILD_TYPE=Release -S ./duckdb/ -B build/release && \
cmake --build build/release --config Release
-- git hash 1f98600c2c, version v1.0.0, extension folder v1.0.0
-- Extensions will be deployed to: /home/ramon/Desktop/PROJECTS/duckdb_iceberg/build/release/repository
-- Load extension 'iceberg' from '/home/ramon/Desktop/PROJECTS/duckdb_iceberg' @ 3f6d753
-- Load extension 'httpfs' from '/home/ramon/Desktop/PROJECTS/duckdb_iceberg/duckdb/extensions' @ 1f98600c2c
-- Load extension 'parquet' from '/home/ramon/Desktop/PROJECTS/duckdb_iceberg/duckdb/extensions' @ 1f98600c2c
-- Load extension 'jemalloc' from '/home/ramon/Desktop/PROJECTS/duckdb_iceberg/duckdb/extensions' @ 1f98600c2c
CMake Error at /home/ramon/Desktop/PROJECTS/duckdb_iceberg/CMakeLists.txt:32 (find_library):
  Could not find AVROCPP_LIBRARY_DEBUG using the following names: avrocpp_s


-- Configuring incomplete, errors occurred!
See also "/home/ramon/Desktop/PROJECTS/duckdb_iceberg/build/release/CMakeFiles/CMakeOutput.log".
make: *** [extension-ci-tools/makefiles/duckdb_extension.Makefile:56: release] Error 1

ORC Files

Iceberg tables can use ORC files for storage, this extension should flag that error.

Can't get AWS Athena Iceberg metadata because there's no `metadata/version-hint.text` file

I went ahead and gave this still-very-early extension a try, using my own Iceberg data set that I build with AWS Athena. No dice, I get this:

D SELECT *  FROM iceberg_metadata('s3://my-bucket/path/to/table/folder');
Error: Invalid Error: HTTP Error: Unable to connect to URL "https://my-bucket.s3.amazonaws.com/path/to/table/folder/metadata/version-hint.text": 404 (Not Found)

Looking at the current source code I see that it comes from here.

Searching the web a bit for relevant info I find this passage in this article:

Iceberg catalog

Within the catalog, there is a reference or pointer for each table to that table’s current metadata file. For example, in the diagram shown above, there are 2 metadata files. The value for the table’s current metadata pointer in the catalog is the location of the metadata file on the right.

What this data looks like is dependent on what Iceberg catalog is being used. A few examples:

  • With HDFS as the catalog, there’s a file called version-hint.text in the table’s metadata folder whose contents is the version number of the current metadata file.
  • With Hive metastore as the catalog, the table entry in the metastore has a table property which stores the location of the current metadata file.
  • With Nessie as the catalog, Nessie stores the location of the current metadata file for the table.

Obviously Athena is somehow storing this metadata in the AWS Glue Catalog (perhaps accessible with the same protocol as a Hive metastore?), and the extension doesn't integrate that yet.

I do realize this is an extremely early stage for the extension, just holler in this ticket when you think it might work (or just if you're puzzled whether it might) and I'm happy to test this out some other time.

Correct field projection not working

In Iceberg the tables are projected using field-IDs. Even if the column is renamed (and Iceberg is lazy, so existing data will not be rewritten), it should correct map the name.

Example:

from pyiceberg.catalog.sql import SqlCatalog

catalog = SqlCatalog("test_sql_catalog", uri="sqlite:///:memory:", warehouse=f"/{warehouse}")
catalog.create_namespace("default")

import pyarrow as pa

pa_schema = pa.schema([
    pa.field('animal', pa.string(), nullable=False),
    pa.field('n_legs', pa.int32(), nullable=False),
])

# Empty table
tbl = catalog.create_table("default.test_id_projection", schema=pa_schema)

# Write some data
tbl.append(
    df=pa.Table.from_arrays([pa.array(["Flamingo", "Horse"]), pa.array([2, 4])], schema=pa_schema),
)

# Rename the column
with tbl.update_schema() as upd:
    upd.rename_column("n_legs", "number_of_legs")

# Write more data
tbl.append(
    df=pa.Table.from_arrays(
        [pa.array(["Brittle stars", "Centipede"]), pa.array([5, 100])],
        schema=pa.schema([
            pa.field('animal', pa.string(), nullable=False),
            pa.field('number_of_legs', pa.int32(), nullable=False),
        ]),
    ),
)

location = tbl.metadata_location

import duckdb

duckdb.sql('INSTALL iceberg; LOAD iceberg;')
result = duckdb.sql(
    f"""
SELECT *
FROM iceberg_scan('{location}')
"""
).fetchall()

assert result == [
    ('Flamingo', 2),
    ('Horse', 4),
    ('Brittle stars', 5),
    ('Centipede', 100),
]

This fails with:

>       ).fetchall()
E       duckdb.duckdb.IOException: IO Error: Failed to read file "//private/var/folders/22/yb9h2zd55ql37h4_50xkmg7r0000gn/T/pytest-of-fokkodriesprong/pytest-8/test_sql0/default.db/test_id_projection/data/00000-0-d24c398f-f93f-420c-82d4-2a66b72c580e.parquet": schema mismatch in glob: column "number_of_legs" was read from the original file "//private/var/folders/22/yb9h2zd55ql37h4_50xkmg7r0000gn/T/pytest-of-fokkodriesprong/pytest-8/test_sql0/default.db/test_id_projection/data/00000-0-f2b7f973-f7df-4cd6-aa7a-78176b3c3d7f.parquet", but could not be found in file "//private/var/folders/22/yb9h2zd55ql37h4_50xkmg7r0000gn/T/pytest-of-fokkodriesprong/pytest-8/test_sql0/default.db/test_id_projection/data/00000-0-d24c398f-f93f-420c-82d4-2a66b72c580e.parquet".
E       Candidate names: animal, n_legs
E       If you are trying to read files with different schemas, try setting union_by_name=True

0.10.0 Regression - Cannot parse metadata

Starting with DuckDB 0.10.0, I can no longer use the extension to scan iceberg files

select * from iceberg_scan('s3://zones/iceberg/test/zones_43_1b7d1e67-8f53-42ab-ba55-a66f438f344f/metadata/00006-0d6ca258-8039-446f-843a-69e21344f272.metadata.json');

Error: IO Error: Invalid field found while parsing field: required

The above works fine with DuckDB 0.9.2

I attached the metadata file, which was created by Iceberg.
00006-0d6ca258-8039-446f-843a-69e21344f272.metadata.json

Can't iceberg_scan specific manifest version from S3: No such file or directory

I'm trying to use iceberg extension to read iceberg data from S3.
As a test data I'm using an example attached to the iceberg extension doc page.
As s3 storage I'm using MinIO in docker.
I execute my query from java code, using org.duckdb:duckdb_jdbc:1.0.0.

  1. When I query the whole table with allow_moved_paths = true, the tool works fine.

  2. When I query a specific metadata (e.g. v1), following the example from the doc I get an error:

java.sql.SQLException: IO Error: Cannot open file "lineitem_iceberg/metadata/snap-3776207205136740581-1-cf3d0be5-cf70-453d-ad8f-48fdc412e608.avro": No such file or directory
java.sql.SQLException: java.sql.SQLException: IO Error: Cannot open file "lineitem_iceberg/metadata/snap-3776207205136740581-1-cf3d0be5-cf70-453d-ad8f-48fdc412e608.avro": No such file or directory

Query looks like this:
SELECT * FROM iceberg_scan('s3://bucketname/lineitem_iceberg/metadata/v1.metadata.json');

  1. If I try to use another example:
    SELECT * FROM iceberg_scan('s3://bucketname/lineitem_iceberg/metadata/02701-1e474dc7-4723-4f8d-a8b3-b5f0454eb7ce.metadata.json', allow_moved_paths = true);

I get an exception: Enabling allow_moved_paths is not enabled for directly scanning metadata files., because of this line.

The questions are:

  1. Why do we prohibit to use allow_moved_paths when querying specific version of metadata? Maybe we can remove this check and allow allow_moved_paths?
  2. Any other idea what's wrong here and how can we fix?

Support for the `file://` prefix

Most IO implementations allow the file:// prefix to indicate that the data is local. Would be great if DuckDB can support this as well 👍

Write Support

Is there write support on the roadmap? If so, is there an ETA? I'll have to plan a future project and would like to keep using duckdb.

Table schema evolution support

First of all, thank you for this useful extension.
I can use the extension to read iceberg tables just fine.
As soon as the schema changes, the extension throws an error

Use case:

Using Trino to manage Iceberg on AWS/Glue/S3, I issued the following:

create table test(a int);
insert into test values(1);
alter table test add column b int;
insert into test values(2,5);

select * from test;
a | b
---+------
1 | NULL
2 | 5

Using the last metadata file, I issued the following in DuckDB (after loading the aws and iceberg extensions)

select * from ICEBERG_SCAN('s3:///test-0fc6feb5c66b4915b39dd2d0511105b7/metadata/00005-5a37e1af-dbf3-48c7-b7c7-11309ecc6279.metadata.json');

Error: IO Error: Failed to read file "s3:///test-0fc6feb5c66b4915b39dd2d0511105b7/data/20231204_085221_00020_8b7ws-30eb76d3-2f3e-4a01-a030-976f7640d26a.parquet": schema mismatch in glob: column "b" was read from the original file "s3:///test-0fc6feb5c66b4915b39dd2d0511105b7/data/20231204_085314_00023_8b7ws-1e49b4ae-9d23-469b-8e00-6dd48da1d0a0.parquet", but could not be found in file "s3:///test-0fc6feb5c66b4915b39dd2d0511105b7/data/20231204_085221_00020_8b7ws-30eb76d3-2f3e-4a01-a030-976f7640d26a.parquet".
Candidate names: a
If you are trying to read files with different schemas, try setting union_by_name=True

As I mentioned above, reading the data before the schema change worked just fine.

Thank you,
Harel

Compiler warnings

/__w/duckdb/duckdb/build/release/_deps/iceberg_extension_fc-src/src/include/iceberg_types.hpp: In function ‘std::string duckdb::IcebergManifestContentTypeToString(duckdb::IcebergManifestContentType)’:
Warning: /__w/duckdb/duckdb/build/release/_deps/iceberg_extension_fc-src/src/include/iceberg_types.hpp:30:1: warning: control reaches end of non-void function [-Wreturn-type]
   30 | }
      | ^
/__w/duckdb/duckdb/build/release/_deps/iceberg_extension_fc-src/src/include/iceberg_types.hpp: In function ‘std::string duckdb::IcebergManifestEntryStatusTypeToString(duckdb::IcebergManifestEntryStatusType)’:
Warning: /__w/duckdb/duckdb/build/release/_deps/iceberg_extension_fc-src/src/include/iceberg_types.hpp:43:1: warning: control reaches end of non-void function [-Wreturn-type]
   43 | }
      | ^
/__w/duckdb/duckdb/build/release/_deps/iceberg_extension_fc-src/src/include/iceberg_types.hpp: In function ‘std::string duckdb::IcebergManifestEntryContentTypeToString(duckdb::IcebergManifestEntryContentType)’:
Warning: /__w/duckdb/duckdb/build/release/_deps/iceberg_extension_fc-src/src/include/iceberg_types.hpp:56:1: warning: control reaches end of non-void function [-Wreturn-type]
   56 | }

Probably want to add a default: throw InternalException(...) to silence these warnings

Can't use the extension if my data catalog did not create a version-hint.text file

My s3 bucket with iceberg (picture below) cannot be queried with

iceberg_scan('s3://bucket/iceberg', ALLOW_MOVED_PATHS=true)

nor

iceberg_scan('s3://bucket/iceberg/*', ALLOW_MOVED_PATHS=true)

In particular the system is trying to find a very specific file (so the * pattern gets ignored):

duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL https://bucket.s3.amazonaws.com/iceberg/metadata/version-hint.text

Unfortunately that file does not exist in my iceberg/ folder, nor in any of the iceberg/sub/metadata folders. Compared to the data zip in duckdb docs about iceberg, it is clear "my iceberg tables" are missing that file, which is important for the current implementation.

That said, version-hint seems something we do not really need, as that info can default to a version or being an additional parameter perhaps (instead of failing if the file is not found)?

Original discussion with @Alex-Monahan in dbt Slack is here: note that I originally got pointed to this as a possible cause, so perhaps reading a table that is formally Iceberg is not really independent from the data catalog it belongs to?

s3_structure

support `ATTACH` statement for Iceberg catalog (similar to UC_CATALOG)

The uc_catalog extension supports the following:

INSTALL uc_catalog from core_nightly;
INSTALL delta from core;
LOAD delta;
LOAD uc_catalog;
CREATE SECRET (
	TYPE UC,
	TOKEN '${UC_TOKEN}',
	ENDPOINT '${UC_ENDPOINT}',
	AWS_REGION '${UC_AWS_REGION}'
)
ATTACH 'test_catalog' AS test_catalog (TYPE UC_CATALOG)
SHOW ALL TABLES;
SELECT * FROM test_catalog.test_schema.test_table;

This would be extremely helpful to have for Iceberg.

Internal Error When Querying Empty Tables With List Columns

Hey,
I have the following empty iceberg table:

table_name (
1: col1: optional string,
2: col2: optional timestamptz,
3: col3: optional string,
4: col4: optional timestamptz,
5: col5: optional timestamptz,
6: col6: optional list,
7: col7: optional timestamptz,
8: col8: optional list,
9: col9: optional list,
10: col10: optional list,
11: col11: optional string,
12: col12: optional string,
13: col13: optional string,
14: col14: optional list,
15: col15: optional string,
16: col16: optional list,
17: col17: optional list,
18: col18: optional list,
19: col19: optional list,
20: col20: optional list,
21: col21: optional list,
22: col22: optional list,
23: col23: optional boolean,
24: col24: optional list,
25: col25: optional list,
26: col26: optional list,
27: col27: optional timestamptz,
28: col28: optional timestamptz,
29: col29: optional timestamptz
),
partition by: [],
sort order: [],
snapshot: Operation.APPEND: id=3172179944265825688, schema_id=0

When using PyIceberg to scan it into a duckdb conn, it works and returns empty df like expected.
However, when trying to query it directly with duckdb like this:

duckdb.sql(f"""
INSTALL iceberg; 
LOAD iceberg; 
SELECT * FROM iceberg_scan('{table.metadata_location}', skip_schema_inference=True)""").fetchdf()

I get the following error:
InternalException: INTERNAL Error: Value::LIST without providing a child-type requires a non-empty list of values. Use Value::LIST(child_type, list) instead.
This error signals an assertion failure within DuckDB. This usually occurs due to unexpected conditions or errors in the program's logic.
For more information, see https://duckdb.org/docs/dev/internal_errors

If I try to use the schema_inference I get this error:
IOException: IO Error: Invalid field found while parsing field: type

Thanks!

Predicate Pushdown for scans

It currently appears that predicate pushdown isn't applied before calculating what Parquet files are included in scanning the table's data.

The AVRO manifest files include statistics about each Parquet file. Those should be leveraged to reduce the number of files that need to be read. Additionally there is the byte offset for each row group in each Parquet file. That can prevent the need to read the footer of the Parquet field and allow better concurrent scanning.

Hibben partition pruning

Iceberg has support for hidden partitioning. Data written to a partitioned table will be split up based on the column and the applied transform:

image

A partitioned table can tremendously speed up the table, since each ManifestList keeps a summary of the partition range. This way on a high level, the files can be pruned.

Important to note here that partitions can evolve over time with Iceberg. For example, when a partition is updated from daily to hourly partitions due to increased data, then new data will be written using the new partitioning strategy. The old data will not change and will remain partitioned on a daily basis until the data is rewritten.

Pruning of manifest files

Manifest list:

{
	"manifest_path": "s3://warehouse/nyc/taxis/metadata/65085354-ca41-4ae0-b454-04bc1a669f19-m0.avro",
	"manifest_length": 16797,
	"partition_spec_id": 0,  # field-id: 502
	"content": 0,
	"sequence_number": 3,
	"min_sequence_number": 3,
	"added_snapshot_id": 864191924052698687,
	"added_data_files_count": 42,
	"existing_data_files_count": 0,
	"deleted_data_files_count": 0,
	"added_rows_count": 3627882,
	"existing_rows_count": 0,
	"deleted_rows_count": 0,
	"partitions": {  # field-id: 507
		"array": [{
			"contains_null": false,
			"contains_nan": {
				"boolean": false
			},
			"lower_bound": {
				"bytes": "¤7\u0000\u0000"
			},
			"upper_bound": {
				"bytes": "·J\u0000\u0000"
			}
		}]
	}
}

Since there can be many different partition strategies over time, the evaluators are constructed lazily on PyIceberg. Based on the 502: partition_spec_id we construct an evaluator:

        manifest_evaluators: Dict[int, Callable[[ManifestFile], bool]] = KeyDefaultDict(self._build_manifest_evaluator)

This looks up the partition spec, and creates the evaluator:

    def _build_partition_projection(self, spec_id: int) -> BooleanExpression:
        project = inclusive_projection(self.table.schema(), self.table.specs()[spec_id])
        return project(self.row_filter)

This requires the original schema, since the filter references the original fields. Consider the following schema:

schema {
  1: id int
}

For example, id = 10, and there is a bucket(id) partitioning on the table. Then the spec is as:

[
     1000: id: bucket[22](1)
]

The PartitionSpec has a single field, that starts from field-id 1000. It references field-id 1, which corresponds with id. This way we decouple the filtering from the naming.

The most important part where it is filtered:

class InclusiveProjection(ProjectionEvaluator):
    def visit_bound_predicate(self, predicate: BoundPredicate[Any]) -> BooleanExpression:
        parts = self.spec.fields_by_source_id(predicate.term.ref().field.field_id)

        result: BooleanExpression = AlwaysTrue()
        for part in parts:
            # consider (d = 2019-01-01) with bucket(7, d) and bucket(5, d)
            # projections: b1 = bucket(7, '2019-01-01') = 5, b2 = bucket(5, '2019-01-01') = 0
            # any value where b1 != 5 or any value where b2 != 0 cannot be the '2019-01-01'
            #
            # similarly, if partitioning by day(ts) and hour(ts), the more restrictive
            # projection should be used. ts = 2019-01-01T01:00:00 produces day=2019-01-01 and
            # hour=2019-01-01-01. the value will be in 2019-01-01-01 and not in 2019-01-01-02.
            incl_projection = part.transform.project(name=part.name, pred=predicate)
            if incl_projection is not None:
                result = And(result, incl_projection)

        return result

Here we start with AlwaysTrue(), so until there is any evidence that the manifests are not relevant, it will be included. It will return a callable that can be bound to the row_filter:

    def project(self, expr: BooleanExpression) -> BooleanExpression:
        #  projections assume that there are no NOT nodes in the expression tree. to ensure that this
        #  is the case, the expression is rewritten to push all NOT nodes down to the expression
        #  leaf nodes.
        #  this is necessary to ensure that the default expression returned when a predicate can't be
        #  projected is correct.
        return visit(bind(self.schema, rewrite_not(expr), self.case_sensitive), self)

Pruning of the manifests

Manifest file:

When the manifest is considered relevant based on the the range:

[{
  "status": 1,
  "snapshot_id": {
    "long": 4554453935356538000
  },
  "data_file": {
    "file_path": "s3://warehouse/nyc/taxis/data/tpep_pickup_datetime_day=2021-04-29/00003-10-989122fb-e6e6-449e-8290-57871f57028f-00001.parquet",
    "file_format": "PARQUET",
    "partition": {
      "tpep_pickup_datetime_day": {
        "int": 18746 // Partition: Thursday, April 29, 2021
      }
    },
...
}

This re-uses the same partition projection:

    @cached_property
    def partition_filters(self) -> KeyDefaultDict[int, BooleanExpression]:
        return KeyDefaultDict(self._build_partition_projection)

And the meat is in this function:

    def _build_partition_evaluator(self, spec_id: int) -> Callable[[DataFile], bool]:
        spec = self.table.specs()[spec_id]
        partition_type = spec.partition_type(self.table.schema())
        partition_schema = Schema(*partition_type.fields)
        partition_expr = self.partition_filters[spec_id]

        evaluator = visitors.expression_evaluator(partition_schema, partition_expr, self.case_sensitive)
        return lambda data_file: evaluator(data_file.partition)

We create a schema from the partition spec. We create a struct from the filter struct<int>, and we create an evaluator on top of it to see if there are rows that might match, based on the partition specification.

I hope this helps, I'm happy to provide more context, and when I get the time, I'll see if I can brush off my cpp skills.

Question: how to speed up iceberg reads

I'm currently reading from my glue-cataloged iceberg table using the following:

duckdb.sql(
    f"""
           INSTALL httpfs;
           LOAD httpfs;
           set s3_region = 'us-west-2';
           set s3_access_key_id = '{settings.AWS_ACCESS_KEY_ID}';
           set s3_secret_access_key = '{settings.AWS_SECRET_ACCESS_KEY}';
           INSTALL iceberg;
           LOAD iceberg;
           """
)
res = duckdb.execute(
	  "SELECT * FROM iceberg_scan('s3://foopath) LIMIT 100"
)

The execution is very slow compared to just reading from the .parquet files at the same path (eg. 2 minutes vs 2 seconds).

res = duckdb.execute(
	  "SELECT * FROM parquet_scan('s3://foopath/*.parquet) LIMIT 100"
)

Would like to know what I'm doing wrong or if someone has a solution

IOException when querying table with a list<int> column.

Hello, thanks for your great work on duckdb-iceberg!

I'm encountering an issue where attempting to query an iceberg table with iceberg_scan fails if the table has a column which is a list of ints (list<int>).

table_with_lists # a pyiceberg.table.Table object
table_without_lists # another pyiceberg.table.Table object

duckdb.sql("INSTALL ICEBERG; LOAD ICEBERG;")

# querying the table that doesn't feature any lists works fine
duckdb.sql(f"SELECT * FROM iceberg_scan('{table_without_lists.metadata_location}') LIMIT 10")

# querying the table that *does* have lists throws an error
duckdb.sql(f"SELECT * FROM iceberg_scan('{table_with_lists.metadata_location}') LIMIT 10")
# IOException: IO Error: Invalid field found while parsing field: type

This seems to be related to an error parsing the table schema. I was wondering if there maybe was no logic for complex types in the duckdb-iceberg extension, but it appears that there is.

binder error when directly scanning metadata file

Hey,

I am encountering an issue while trying to scan metadata files directly from an Iceberg table located in S3.

Here are the details:

  • Data Source: Iceberg table in S3, written by Flink with AWS Glue Catalog.
  • DuckDB Version: v0.9.1

Query

select * from  iceberg_scan('s3://<...>/metadata/00136-4330ee25-0db5-42a4-8230-035d739f76e7.metadata.json') limit 1;

Error:

Binder Error: Table "iceberg_scan_deletes" does not have a column named "file_path"

Would appreciate any insights or suggestions on what could be causing this issue. Thank you!

Unable to do JOIN between to iceberg based tables

Example of what I'm trying to do:

select * from iceberg_scan('s3://<url1>.metadata.json') as t1 join iceberg_scan('s3://t2.metadata.json') as t2 on 1=1

When running the above I get this error:
Binder Error: Duplicate alias \"iceberg_scan_data\" in query!

Although I'm passing an alias for each iceberg_scan operation, it's still assigning the same iceberg_scan_data alias to both.

A suggestion would be to add the alias ass a parameter to the function.

Could not read iceberg table from s3 "...metadata/version-hint.text": 404 (Not Found)"

What happens?

Can't read iceberg table created with Trino on s3.

D SELECT count(*) FROM iceberg_scan('s3://test/iceberg_p2/orders'); Error: HTTP Error: Unable to connect to URL "https://test.gateway.storjshare.io/iceberg_p2/orders/metadata/version-hint.text": 404 (Not Found) D

trying same way but local drive:

D SELECT count(*) FROM iceberg_scan('/Users/yuriygavrilov/Downloads/orders', ALLOW_MOVED_PATHS=true); Error: IO Error: Cannot open file "/Users/yuriygavrilov/Downloads/orders/metadata/version-hint.text": No such file or directory

it seems the problem in file metadata/version-hint.text but why this is mandatory to have it? .. Trino didn't ask it.

To Reproduce

try to open this with iceberg format
orders.zip

OS:

14.0 (23A344)

DuckDB Version:

v0.9.1

DuckDB Client:

cli

Full Name:

Yuriy Gavrilov

Affiliation:

S7 Airlines

Have you tried this on the latest main branch?

I have not tested with any build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have

Iceberg REST Catalog Support

Hey, team!

Very excited about the duckdb v0.9 support for iceberg!

I currently use a rest catalog for my iceberg tables and was hoping to be able to wire up duckdb to that rather than point it to the actual underlying data/metadata files.

If this is available, I'd love to use it -- otherwise, I'd be happy to jump in and start coding if this feature is new.

Thanks!

Not able to query iceberg using duckdb extension for hdfs warehouse

We have tables with hadoop and jdbc-catalog saved in hdfs file systems. When query the data by directly giving metadata.json file it throws no such file directory exist IO Error: Cannot open file "hdfs:///warehouse/user_journey_analytics/engagement_analytics/metadata/00065-83cdaa3e-0d01-40e2-acd4-431f560f0aab.metadata.json": No such file or directory. This is my query for the table with jdbc-catalog SELECT count(*) FROM iceberg_scan('hdfs:///warehouse/user_journey_analytics/engagement_analytics/metadata/00065-83cdaa3e-0d01-40e2-acd4-431f560f0aab.metadata.json'); and hadoop catalog with SELECT count(*) FROM iceberg_scan('hdfs:///warehouse/broadcast/aggregated_analytics/metadata/v20.metadata.json'); I installed duckdb on hadoop box to avoid network call and directly query the table that is why /// are present in the query

Querying table fails with ` Value::LIST without providing a child-type requires a non-empty list of values. Use Value::LIST(child_type, list) instead.`

Querying an Iceberg table throws the error

SELECT * from iceberg_scan('gs://path/to/metadata.json');

INTERNAL Error: Value::LIST without providing a child-type requires a non-empty list of values. Use Value::LIST(child_type, list) instead.
This error signals an assertion failure within DuckDB. This usually occurs due to unexpected conditions or errors in the program's logic.
For more information, see https://duckdb.org/docs/dev/internal_errors

The iceberg table is stored in GCS. I've verified that querying the underlying Iceberg parquet tables works, so I believe the issue is coming from the iceberg scan.

Has anyone seen this before?

Support AWS Glue Catalog

I'd like to add support for "schemes" that prefix the location supplied for Iceberg tables.

For example I'd like to support AWS Glue with a scheme like:

iceberg_scan('aws-glue://[catalog id].[database_name].[table_name]').

The use of this scheme would allow DuckDB to access Iceberg tables where the current snapshot version is stored in a Glue data catalog. The Iceberg extension would call Glue's describe_table() API method, get the current metadata_url and parse the JSON file.

I'd also like to add a companion function called iceberg_scan_tables('aws-glue://[catalog id]') that would scan all databases and tables under the schema and create views that are backed by iceberg_scan(), so this can expose the databases and tables in glue to the end user.

How to build iceberg extension

I tried to compile the iceberg extension following the steps in the readme.md, but I ran into a problem, the problem is as follows:

mkdir -p build/release && \
cmake   -DDUCKDB_EXTENSION_NAMES="iceberg" -DDUCKDB_EXTENSION_ICEBERG_PATH="/home/gpadmin/duckdb_iceberg/" -DDUCKDB_EXTENSION_ICEBERG_SHOULD_LINK=1 -DDUCKDB_EXTENSION_ICEBERG_INCLUDE_PATH="/home/gpadmin/duckdb_iceberg/src/include"  -DEXTENSION_STATIC_BUILD=1 -DCMAKE_BUILD_TYPE=Release -DEXTENSION_STATIC_BUILD=1 -DBUILD_EXTENSIONS="httpfs" -DOSX_BUILD_ARCH=   -DVCPKG_MANIFEST_DIR='/home/gpadmin/duckdb_iceberg/' -DVCPKG_BUILD=1 -DCMAKE_TOOLCHAIN_FILE='/home/gpadmin/vcpkg/scripts/buildsystems/vcpkg.cmake' -S ./duckdb/ -B build/release && \
cmake --build build/release --config Release
-- Running vcpkg install
warning: Embedding `vcpkg-configuration` in a manifest file is an EXPERIMENTAL feature.
Detecting compiler hash for triplet x64-linux...
The following packages will be built and installed:
    avro-cpp[core,snappy]:x64-linux -> 2022-11-07 -- /home/gpadmin/duckdb_iceberg/./vcpkg_ports/avro-cpp
  * boost-algorithm:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-algorithm/6a184d48bcf1a89a12f66255625340ce9cbfda92
  * boost-align:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-align/6dcfc6f69223f8fffff4c82db31be7495f17062d
  * boost-any:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-any/798ddb57d8d3bb7a77e424cd5d60f88cc4e6bee7
  * boost-array:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-array/8503faaf38485a913d4daf3efd2076b67bebdc0b
  * boost-assert:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-assert/f07d2980139502309087821ca9a78a9a43a766ec
  * boost-atomic:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-atomic/ed19c1a3b7f57d00917b260e35cd662b28c1f6eb
  * boost-bind:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-bind/1d9372948d3c0fe1da8965e76ded151513f16a52
  * boost-build:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-build/9d4a35930b7414144551dc9c2e00b31bcb0ca8b2
  * boost-concept-check:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-concept-check/c266d37e02d494dcf4c3623bae07775c2a7baa45
  * boost-config:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-config/af2f41ef14357f44466182d783e8cb218d677b1f
  * boost-container:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-container/54ac399bdf18e9c6ab8c9df4d81df4bd86a547d9
  * boost-container-hash:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-container-hash/c5bde693ed8971a4e0ba6e22854386d4054898a8
  * boost-conversion:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-conversion/e99972dc970d3a2229996b240d5dc4362204b6dc
  * boost-core:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-core/f406b6bfcf32cb14d17d1df5e57a1162a2f2d703
  * boost-crc:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-crc/7146ce31ceb0cdce252cdf556502fbd358eb30d5
  * boost-describe:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-describe/631c0781e5719b44d906954a57f3a9a42bb37e8a
  * boost-detail:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-detail/ea40bd02426af4c5ce99995762c452e3a70b5189
  * boost-dynamic-bitset:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-dynamic-bitset/d350f9d70da182a7abddd0cadc649394fb144113
  * boost-exception:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-exception/ef4cfcde16100435562ccebf90edca8cfc4e204c
  * boost-filesystem:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-filesystem/9e51e733f0ddb9e287a6ee03bf9cbb1f3f65f081
  * boost-format:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-format/9562adeb23d81f0bfb89adb5c920ead09b302ec5
  * boost-function:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-function/dce4e73b054f0b611a8d1e0f9efc9dd8cd79071f
  * boost-function-types:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-function-types/1da109d6c53f82fdac9bdf6efa17af3d47dacb54
  * boost-functional:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-functional/3c428ee1b1238a5fbf337fd549d3b3c3afc17930
  * boost-fusion:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-fusion/0c32122b9244e041424874f6657451169ab9522d
  * boost-integer:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-integer/ed95b549c8719f65d4599fb6a612db94043c033f
  * boost-intrusive:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-intrusive/c6ef0a6523433599c5a6e0bebedd4b2caa6b56cf
  * boost-io:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-io/21da7534c21a1dfd1ff5833e330837ad8fa26341
  * boost-iostreams[bzip2,core,lzma,zlib,zstd]:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-iostreams/9a39f5835d29b23ad73b036797cb4f6f000e83c2
  * boost-iterator:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-iterator/06e57442234e51fd436cc31ac285319baaf83e26
  * boost-lexical-cast:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-lexical-cast/54b82af182eb01789c463148630e7e897ae3f3b7
  * boost-math:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-math/61b0ddfd8a1a5e18f97584b57143de200770bd56
  * boost-modular-build-helper:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-modular-build-helper/5c1a238cc20c885e785bc05a36b974deb5e02db7
  * boost-move:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-move/1ce5f627ac0c30d041b9f2981fd3c11e4b504825
  * boost-mp11:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-mp11/80e104d697b7ff7a283c5f7ee83ae25fe1192bd7
  * boost-mpl:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-mpl/e08d5a2eb5d5edfc5a62469dbaa283cf56570d07
  * boost-numeric-conversion:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-numeric-conversion/97842146c6a87912141dd1d4654e1576d12e5cef
  * boost-optional:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-optional/100ff3094e97d23516718679d59d9f86821074b8
  * boost-predef:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-predef/3a0c5194d7a4d8825efff8d67e0c75cae0b058d7
  * boost-preprocessor:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-preprocessor/85819cdd004323b0afe1a361b9e05fdb55ab6838
  * boost-program-options:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-program-options/c39f818a3734e996eb777117eb8d1492d7b1b26f
  * boost-random:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-random/0c97951b0f38d6c9b469b367ac136249e1fa23fd
  * boost-range:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-range/fc4c6c85df6efb4afe27a53499abdcd17da71a69
  * boost-regex:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-regex/d31af2d45fe496fd2d89b734027e72c92d23f1d1
  * boost-smart-ptr:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-smart-ptr/a61a6534966bd4a70b051538a287de38be4345ba
  * boost-static-assert:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-static-assert/e6e55a999a301eb62d7c4da853a6aeb296713943
  * boost-system:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-system/48fb125e52a85de87f4ff03475742d79f84b42cb
  * boost-throw-exception:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-throw-exception/c51c9fa76f925c5824688c6ca9f451716373b5d5
  * boost-tokenizer:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-tokenizer/4a53c730b20cacfbd8e6bbbd8faaeee0720c9fc1
  * boost-tuple:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-tuple/c9173d5394c4b31fdf698b156b542c29031d03fe
  * boost-type-index:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-type-index/d02bcbc9c4ee9a4b27e90ab03010bea7b6c70eb3
  * boost-type-traits:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-type-traits/34773fec432e120d3f2d9dd482461da7c36ef897
  * boost-typeof:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-typeof/ab6e440573d5f3ac964c50b3efc24055500a772f
  * boost-uninstall:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-uninstall/d96807810fa612b038f9268c8df545d421fda1a6
  * boost-unordered:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-unordered/ee2a3d19df6c91036cb772b4747bc7244307cda1
  * boost-utility:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-utility/b94486558c58f29bb5165d952196d448719caae5
  * boost-variant2:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-variant2/3acfc977da209b860fd7cf1365180af919be0eeb
  * boost-vcpkg-helpers:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-vcpkg-helpers/f99477ea8dcfb463d2f0a2cce3c60914c2e5adb4
  * boost-winapi:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-winapi/db66242e796c29abd3622cf41d506bee622b4d10
  * bzip2[core,tool]:x64-linux -> 1.0.8#4 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/bzip2/6165360d15e6de08dff3a5f079d51e69908cc55d
  * liblzma:x64-linux -> 5.4.1#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/liblzma/9a3ebbf75e1c3adb76d7e71717c1f03bb291036a
    openssl:x64-linux -> 3.0.8 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/openssl/9cd36489fc65db87c4cd94ac606abd05c1db8f2d
  * snappy:x64-linux -> 1.1.9#5 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/snappy/a97f6aac039dc2b5e6fdac753d66b94cfb408d30
  * vcpkg-cmake:x64-linux -> 2022-12-22 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/vcpkg-cmake/1913f86defd2140d0a6751be2d51952e4090efa4
  * vcpkg-cmake-config:x64-linux -> 2022-02-06#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/vcpkg-cmake-config/8d54cc4f487d51b655abec5f9c9c3f86ca83311f
  * vcpkg-cmake-get-vars:x64-linux -> 2023-03-02 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/vcpkg-cmake-get-vars/2e624c2cf12a97a7a802e31ff1d28b9fa6ba9bde
  * zlib:x64-linux -> 1.2.13 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/zlib/ad5a49006f73b45b715299515f31164131b51982
  * zstd:x64-linux -> 1.5.4#2 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/zstd/fcc4fe2738d903bc347d20d2d4c836699e273509
Additional packages (*) will be modified to complete this operation.
Restored 24 package(s) from /home/gpadmin/.cache/vcpkg/archives in 55.1 ms. Use --debug to see more details.
Installing 1/69 vcpkg-cmake-config:x64-linux...
Elapsed time to handle vcpkg-cmake-config:x64-linux: 333 us
Installing 2/69 vcpkg-cmake:x64-linux...
Elapsed time to handle vcpkg-cmake:x64-linux: 249 us
Installing 3/69 snappy:x64-linux...
Elapsed time to handle snappy:x64-linux: 484 us
Installing 4/69 zlib:x64-linux...
Elapsed time to handle zlib:x64-linux: 369 us
Installing 5/69 boost-uninstall:x64-linux...
Elapsed time to handle boost-uninstall:x64-linux: 262 us
Installing 6/69 boost-vcpkg-helpers:x64-linux...
Elapsed time to handle boost-vcpkg-helpers:x64-linux: 283 us
Installing 7/69 boost-config:x64-linux...
Elapsed time to handle boost-config:x64-linux: 2.84 ms
Installing 8/69 boost-static-assert:x64-linux...
Elapsed time to handle boost-static-assert:x64-linux: 395 us
Installing 9/69 boost-type-traits:x64-linux...
Elapsed time to handle boost-type-traits:x64-linux: 3.48 ms
Installing 10/69 boost-assert:x64-linux...
Elapsed time to handle boost-assert:x64-linux: 580 us
Installing 11/69 boost-throw-exception:x64-linux...
Elapsed time to handle boost-throw-exception:x64-linux: 526 us
Installing 12/69 boost-core:x64-linux...
Elapsed time to handle boost-core:x64-linux: 1.39 ms
Installing 13/69 boost-tuple:x64-linux...
Elapsed time to handle boost-tuple:x64-linux: 493 us
Installing 14/69 boost-preprocessor:x64-linux...
Elapsed time to handle boost-preprocessor:x64-linux: 4.41 ms
Installing 15/69 boost-io:x64-linux...
Elapsed time to handle boost-io:x64-linux: 667 us
Installing 16/69 boost-utility:x64-linux...
Elapsed time to handle boost-utility:x64-linux: 837 us
Installing 17/69 boost-predef:x64-linux...
Elapsed time to handle boost-predef:x64-linux: 2.06 ms
Installing 18/69 boost-winapi:x64-linux...
Elapsed time to handle boost-winapi:x64-linux: 1.56 ms
Installing 19/69 boost-mp11:x64-linux...
Elapsed time to handle boost-mp11:x64-linux: 863 us
Installing 20/69 boost-variant2:x64-linux...
Elapsed time to handle boost-variant2:x64-linux: 668 us
Installing 21/69 vcpkg-cmake-get-vars:x64-linux...
Elapsed time to handle vcpkg-cmake-get-vars:x64-linux: 685 us
Installing 22/69 boost-modular-build-helper:x64-linux...
Elapsed time to handle boost-modular-build-helper:x64-linux: 704 us
Installing 23/69 boost-build:x64-linux...
Elapsed time to handle boost-build:x64-linux: 9.99 ms
Installing 24/69 boost-system:x64-linux...
Building boost-system:x64-linux...
-- Installing port from location: /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-system/48fb125e52a85de87f4ff03475742d79f84b42cb
-- Downloading https://github.com/boostorg/system/archive/boost-1.82.0.tar.gz -> boostorg-system-boost-1.82.0.tar.gz...
-- Cleaning sources at /home/gpadmin/vcpkg/buildtrees/boost-system/src/ost-1.82.0-6fabddb1d7.clean. Use --editable to skip cleaning for the packages you specify.
-- Extracting source /home/gpadmin/vcpkg/downloads/boostorg-system-boost-1.82.0.tar.gz
-- Using source at /home/gpadmin/vcpkg/buildtrees/boost-system/src/ost-1.82.0-6fabddb1d7.clean
-- Getting CMake variables for x64-linux
-- Configuring x64-linux
-- Building x64-linux-dbg
CMake Error at scripts/cmake/vcpkg_execute_build_process.cmake:134 (message):
    Command failed: /home/gpadmin/vcpkg/downloads/tools/cmake-3.27.1-linux/cmake-3.27.1-linux-x86_64/bin/cmake --build . --config Debug --target install -- -v -j17
    Working Directory: /home/gpadmin/vcpkg/buildtrees/boost-system/x64-linux-dbg
    See logs for more information:
      /home/gpadmin/vcpkg/buildtrees/boost-system/install-x64-linux-dbg-out.log

Call Stack (most recent call first):
  /home/gpadmin/duckdb_iceberg/build/release/vcpkg_installed/x64-linux/share/vcpkg-cmake/vcpkg_cmake_build.cmake:74 (vcpkg_execute_build_process)
  /home/gpadmin/duckdb_iceberg/build/release/vcpkg_installed/x64-linux/share/vcpkg-cmake/vcpkg_cmake_install.cmake:16 (vcpkg_cmake_build)
  /home/gpadmin/duckdb_iceberg/build/release/vcpkg_installed/x64-linux/share/boost-build/boost-modular-build.cmake:140 (vcpkg_cmake_install)
  buildtrees/versioning_/versions/boost-system/48fb125e52a85de87f4ff03475742d79f84b42cb/portfile.cmake:14 (boost_modular_build)
  scripts/ports.cmake:168 (include)


error: building boost-system:x64-linux failed with: BUILD_FAILED
Elapsed time to handle boost-system:x64-linux: 3.1 s
Please ensure you're using the latest port files with `git pull` and `vcpkg update`.
Then check for known issues at:
  https://github.com/microsoft/vcpkg/issues?q=is%3Aissue+is%3Aopen+in%3Atitle+boost-system
You can submit a new issue at:
  https://github.com/microsoft/vcpkg/issues/new?title=[boost-system]+Build+error&body=Copy+issue+body+from+%2Fhome%2Fgpadmin%2Fduckdb_iceberg%2Fbuild%2Frelease%2Fvcpkg_installed%2Fvcpkg%2Fissue_body.md

-- Running vcpkg install - failed
CMake Error at /home/gpadmin/vcpkg/scripts/buildsystems/vcpkg.cmake:899 (message):
  vcpkg install failed.  See logs for more information:
  /home/gpadmin/duckdb_iceberg/build/release/vcpkg-manifest-install.log
Call Stack (most recent call first):
  /usr/local/share/cmake-3.18/Modules/CMakeDetermineSystem.cmake:93 (include)
  CMakeLists.txt:19 (project)


CMake Error: CMake was unable to find a build program corresponding to "Unix Makefiles".  CMAKE_MAKE_PROGRAM is not set.  You probably need to select a different build tool.
CMake Error: CMAKE_C_COMPILER not set, after EnableLanguage
CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
-- Configuring incomplete, errors occurred!
make: *** [Makefile:52: release] Error 1

Binder Error: Table "iceberg_scan_deletes" does not have a column named "file_path"

I am using flink to write data into iceberg table, here's table definition and testing data

CREATE TABLE test (
id INT,
name STRING,
PRIMARY KEY(`id`) NOT ENFORCED
) WITH (
'connector' = 'iceberg',
'catalog-name'='hadoop_catalog',
'catalog-type'='hadoop',
'warehouse'='s3a://umbrella/hadoop/warehouse',
'table-name' = 'test',
'format-version'='2',
'write.upsert.enabled'='true'
);

INSERT INTO test VALUES (1, 'A'), (2, 'B'), (3, 'C');

In duckdb side, It could successfully read iceberg data from s3

D select * FROM iceberg_scan('s3://umbrella/hadoop/warehouse/default_database/test');
┌───────┬─────────┐
│  id   │  name   │
│ int32 │ varchar │
├───────┼─────────┤
│     1 │ A       │
│     2 │ B       │
│     3 │ C       │
└───────┴─────────┘

Then update record whose id = 3 by using upsert in flink

INSERT INTO test /*+ OPTIONS('upsert-enabled'='true') */ VALUES (3, 'C-C')

But I got ' Binder Error: Table "iceberg_scan_deletes" does not have a column named "file_path" ' when querying it again in duckdb

D select * FROM iceberg_scan('s3://umbrella/hadoop/warehouse/default_database/test');
Binder Error: Table "iceberg_scan_deletes" does not have a column named "file_path"

I am using flink-1.18.1 + iceberg-1.5.2 + duckdb-1.0.0, I suppose this is caused by version compatibility ? What's the recommanded iceberg version in duckdb-1.0.0 ? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.