apache / incubator-xtable Goto Github PK

Apache XTable (incubating) is a cross-table converter for lakehouse table formats that facilitates interoperability across data processing systems and query engines.

Home Page: https://xtable.apache.org/

License: Apache License 2.0

Java 90.95% JavaScript 1.02% CSS 6.22% HTML 1.52% MDX 0.29%

apache-hudi apache-iceberg delta-lake

incubator-xtable's Introduction

Apache XTable™ (Incubating)

Apache XTable™ (Incubating) is a cross-table converter for table formats that facilitates omni-directional interoperability across data processing systems and query engines. Currently, Apache XTable™ supports widely adopted open-source table formats such as Apache Hudi, Apache Iceberg, and Delta Lake.

Apache XTable™ simplifies data lake operations by leveraging a common model for table representation. This allows users to write data in one format while still benefiting from integrations and features available in other formats. For instance, Apache XTable™ enables existing Hudi users to seamlessly work with Databricks's Photon Engine or query Iceberg tables with Snowflake. Creating transformations from one format to another is straightforward and only requires the implementation of a few interfaces, which we believe will facilitate the expansion of supported source and target formats in the future.

Building the project and running tests.

Use Java11 for building the project. If you are using some other java version, you can use jenv to use multiple java versions locally.
Build the project using mvn clean package. Use mvn clean package -DskipTests to skip tests while building.
Use mvn clean test or mvn test to run all unit tests. If you need to run only a specific test you can do this by something like mvn test -Dtest=TestDeltaSync -pl core.
Similarly, use mvn clean verify or mvn verify to run integration tests.

Style guide

We use Maven Spotless plugin and Google java format for code style.
Use mvn spotless:check to find out code style violations and mvn spotless:apply to fix them. Code style check is tied to compile phase by default, so code style violations will lead to build failures.

Running the bundled jar

Get a pre-built bundled jar or create the jar with mvn install -DskipTests
create a yaml file that follows the format below:

sourceFormat: HUDI
targetFormats:
  - DELTA
  - ICEBERG
datasets:
  -
    tableBasePath: s3://tpc-ds-datasets/1GB/hudi/call_center
    tableDataPath: s3://tpc-ds-datasets/1GB/hudi/call_center/data
    tableName: call_center
    namespace: my.db
  -
    tableBasePath: s3://tpc-ds-datasets/1GB/hudi/catalog_sales
    tableName: catalog_sales
    partitionSpec: cs_sold_date_sk:VALUE
  -
    tableBasePath: s3://hudi/multi-partition-dataset
    tableName: multi_partition_dataset
    partitionSpec: time_millis:DAY:yyyy-MM-dd,type:VALUE
  -
    tableBasePath: abfs://[email protected]/multi-partition-dataset
    tableName: multi_partition_dataset

sourceFormat is the format of the source table that you want to convert
targetFormats is a list of formats you want to create from your source tables
tableBasePath is the basePath of the table
tableDataPath is an optional field specifying the path to the data files. If not specified, the tableBasePath will be used. For Iceberg source tables, you will need to specify the /data path.
namespace is an optional field specifying the namespace of the table and will be used when syncing to a catalog.
partitionSpec is a spec that allows us to infer partition values. This is only required for Hudi source tables. If the table is not partitioned, leave it blank. If it is partitioned, you can specify a spec with a comma separated list with format path:type:format
- path is a dot separated path to the partition field
- type describes how the partition value was generated from the column value
  - VALUE: an identity transform of field value to partition value
  - YEAR: data is partitioned by a field representing a date and year granularity is used
  - MONTH: same as YEAR but with month granularity
  - DAY: same as YEAR but with day granularity
  - HOUR: same as YEAR but with hour granularity
- format: if your partition type is YEAR, MONTH, DAY, or HOUR specify the format for the date string as it appears in your file paths

The default implementations of table format converters can be replaced with custom implementations by specifying a converter configs yaml file in the format below:

# conversionSourceProviderClass: The class name of a table format's converter factory, where the converter is
#     used for reading from a table of this format. All user configurations, including hadoop config
#     and converter specific configuration, will be available to the factory for instantiation of the
#     converter.
# conversionTargetProviderClass: The class name of a table format's converter factory, where the converter is
#     used for writing to a table of this format.
# configuration: A map of configuration values specific to this converter.
tableFormatConverters:
    HUDI:
      conversionSourceProviderClass: org.apache.xtable.hudi.HudiConversionSourceProvider
    DELTA:
      conversionTargetProviderClass: org.apache.xtable.delta.DeltaConversionTarget
      configuration:
        spark.master: local[2]
        spark.app.name: xtable

A catalog can be used when reading and updating Iceberg tables. The catalog can be specified in a yaml file and passed in with the --icebergCatalogConfig option. The format of the catalog config file is:

catalogImpl: io.my.CatalogImpl
catalogName: name
catalogOptions: # all other options are passed through in a map
  key1: value1
  key2: value2

run with java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml [--hadoopConfig hdfs-site.xml] [--convertersConfig converters.yaml] [--icebergCatalogConfig catalog.yaml] The bundled jar includes hadoop dependencies for AWS, Azure, and GCP. Sample hadoop configurations for configuring the converters can be found in the xtable-hadoop-defaults.xml file. The custom hadoop configurations can be passed in with the --hadoopConfig [custom-hadoop-config-file] option. The config in custom hadoop config file will override the default hadoop configurations. For an example of a custom hadoop config file, see hadoop.xml.

Contributing

Setup

For setting up the repo on IntelliJ, open the project and change the java version to Java11 in File->ProjectStructure

You have found a bug, or have a cool idea you that want to contribute to the project ? Please file a GitHub issue here

Adding a new target format

Adding a new target format requires a developer implement ConversionTarget. Once you have implemented that interface, you can integrate it into the ConversionController. If you think others may find that target useful, please raise a Pull Request to add it to the project.

Overview of the sync process

incubator-xtable's People

Contributors

Stargazers

Watchers

Forkers

yongkyunlee jcamachor sagarlakshmipathy kywe665 divyabrindhar yihua ckonehouse dipankarmazumdar bhasudha jonvex codope lordozb vinishjail97 nssalian cxz ashvina soumilshah1995 genostack saitzaw jackwener haiyang1987 kiddojazz forshareit hadoop835 s7monk dddfans flydreamwangping ai-mou sunxiaojian beyond1920 arthursxl8 selectbook itsharex carlosalexandrecardoso ksumit rakesh-das08 stenicholas rohankumardubey mainmainer ivar3245 qixwang tjx2014 liuxiaocs7 luciferyang chestnutqiang holdenk surya-narayan pintusoliya zyclove yyy1000 coded9 viveklalex lmccay app-creative kevinjqliu tejaswinie sarath2 kick156 anjijava16 morningman arifazmidd lchoudhury horizonzy sfc-gh-eretief intellifora carloscaravantsz xiwang1021 danielscamacho zabetak wuchunfu alberttwong paliwalashish xiabaike iemejia echonesis bibhu107 hussein-awala pratyakshsharma geruh alexmercedcoder alexmerced-oss gupteaj sullis codelipenghui calvinkirs skytin1004 amoghnatu pt657407064 stargrey102 pecbali pateash bbejeck priya-gittest daragu saltukhov1 nhat416 vinlee19 foreverangry aurelsandu fpj

incubator-xtable's Issues

Reduce cases in ITOnetableClient

Relates to #20 and #21

Once the above are completed, we can remove the number of cases we're covering in the integration tests and focus on just a handful key flows as sanity checks that components are working together as expected.

Iterative execution of OneTable failed in deserializing OneSchema

I am reporting this error for awareness. At the moment I am unsure if the test setup is causing this failure. In my test I copied an existing Hudi table from one Az Storage account to another. Then, I executed OneTable conversion twice. The first execution completed successfully. The second attempt failed in deserializing the schema.
Below is the stack trace for reference. As I mentioned earlier, this may be a setup issue.

2023-08-29 09:52:41 INFO  io.onetable.client.OneTableClient:128 - OneTable Sync is successful for the following formats [DELTA, ICEBERG]

$ java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --configFilePath my_config.yaml  -h onetable-hadoop.xml
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-08-29 09:52:52 INFO  io.onetable.utilities.RunSync:96 - Running sync for basePath abfs://[email protected]/call_center for following table formats [DELTA, ICEBERG]
2023-08-29 09:52:52 INFO  org.apache.hudi.common.table.HoodieTableMetaClient:130 - Loading HoodieTableMetaClient from abfs://[email protected]/call_center
2023-08-29 09:52:52 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-08-29 09:52:52 WARN  org.apache.hadoop.fs.azurebfs.utils.SSLSocketFactoryEx:131 - Failed to load OpenSSL. Falling back to the JSSE default.
2023-08-29 09:52:54 INFO  org.apache.hudi.common.table.HoodieTableConfig:268 - Loading table properties from abfs://[email protected]/call_center/.hoodie/hoodie.properties
2023-08-29 09:52:54 INFO  org.apache.hudi.common.table.HoodieTableMetaClient:149 - Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from abfs://[email protected]/call_center
2023-08-29 09:52:54 INFO  org.apache.hudi.common.table.HoodieTableMetaClient:152 - Loading Active commit timeline for abfs://[email protected]/call_center
2023-08-29 09:52:54 INFO  org.apache.hudi.common.table.timeline.HoodieActiveTimeline:171 - Loaded instants upto : Option{val=[20230702230621688__replacecommit__COMPLETED]}
2023-08-29 09:52:54 ERROR io.onetable.utilities.RunSync:116 - Error running sync for abfs://[email protected]/call_center
io.onetable.exception.OneIOException: Failed to get one table state
	at io.onetable.client.OneTableClient.getSyncState(OneTableClient.java:302) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
	at io.onetable.client.OneTableClient.sync(OneTableClient.java:103) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
	at io.onetable.utilities.RunSync.main(RunSync.java:114) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
Caused by: com.fasterxml.jackson.databind.exc.InvalidDefinitionException: Invalid type definition for type `io.onetable.model.schema.OneSchema`: Argument #0 of constructor [constructor for `io.onetable.model.schema.OneSchema` (6 args), annotations: {interface com.fasterxml.jackson.annotation.JsonCreator=@com.fasterxml.jackson.annotation.JsonCreator(mode=DEFAULT)} has no property name (and is not Injectable): can not use as property-based Creator
 at [Source: (StringReader); line: 1, column: 1]
	at com.fasterxml.jackson.databind.exc.InvalidDefinitionException.from(InvalidDefinitionException.java:62) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
...
	at com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:142) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
	at com.fasterxml.jackson.databind.DeserializationContext.findRootValueDeserializer(DeserializationContext.java:642) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
...
	at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3597) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
	at io.onetable.persistence.OneTableStateStorage.readOneTableState(OneTableStateStorage.java:142) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
	at io.onetable.persistence.OneTableStateStorage.read(OneTableStateStorage.java:107) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]

Capture and translate primary key/identifier information between formats

We currently track the partition information but we do not track the primary key or record key style information between the formats. This is necessary for making sure the systems have the correct expectations for uniqueness of rows

Upgrade Iceberg dependency to 1.2.X or 1.3.X version

We started onetable before Iceberg had their 1.0.0 release. Now they're on 1.3.X so we should consider upgrading to pick up any improvements in the core libraries and to avoid falling too far behind.

Delta Source subtask: Partition spec extraction

Add Iceberg source client implementation and tests

Subtasks

Get column stats and partition values from Delta "AddFile" while reading in file metadata

Delta source subtask: schema extractor

Convert to and from OneSchema/StructType

Map Iceberg types to existing OneTable types or create new types in OneTable IR

Iceberg source feature: map types to existing OneTable types or add types in OneTable IR

Originally posted by @ashvina in #56 (comment)

BigLake Metastore Testing

Validate that the outputs of onetable can still be synced with BigLake metastore.

Write up a guide for how to sync the outputs to BigLake referencing existing documentation where appropriate.

Perform testing with large number of commits and partitions

We need to run testing for each source and target with a large number of partitions and commits in a single table (on the order of 10k commits and partitions for starters). This will help us understand any bottlenecks or improvements that can be made in our initial sync flow.

Bump Delta dependency version from 2.1.0 to 2.4.0

This will let us bump Spark to 3.4, which includes support for features such as auto-generated columns.

Delta source subtask: getTableChangeForCommit

Fabric Query Testing

Validate that all formats can be read by the Fabric query engine after being written with Onetable.

Produce a user guide for how to integrate with this engine referencing existing documentation when appropriate.

Iceberg source subtask: getTableChangeForCommit

Redshift spectrum query testing

Validate that all formats can be read by the Redshift query engine after being written with Onetable.

Produce a user guide for how to integrate with this engine referencing existing documentation when appropriate.

Extract field IDs from Delta table if present

Delta tables have some support for tracking a field ID for renaming purposes and iceberg compatibility so let's add that to the schema extraction.

Simplify SourceClient inerface by replacing getSchemaCatalog with getSchema

Each commit is linked to a single schema. As such returning a SchemaCatalog (encapsulates a map of schema version and schema) is confusing. The API could be simplified and the SchemaCatalog could be removed if it has limited functionality.

Originally posted by @ashvina in #75 (comment)

Enable CI for macOS

5d60018 set up CI for Linux (build and run tests) by relying on Azure Pipelines. I'm creating this issue to 1) validate that it is working as intended, and 2) enable CI for macOS as well.

Add Hudi Target Client

With the 0.14.0 release, there is support for handling arbitrary file names in Hudi. This was previously a blocker for making Hudi a target for tables written in the other formats.

Properly reflect rollbacks/restores in target tables

Right now when we see a rollback or restore in the source table, we just treat it as files being removed from the table. We should update this to instead issue a rollback command in the target tables so that the histories are more consistent between the source and target.

Mandatory tableName not set in main method and missing in sample config

tableName is mandatory in PerTableConfig for launching OneTable application. However it is not being set within the main method. Additionally, it is missing in the sample config file provided. As such, the example as presented will fail.

Exception in thread "main" java.lang.NullPointerException: tableName is marked non-null but is null
	at io.onetable.client.PerTableConfig.<init>(PerTableConfig.java:33)
	at io.onetable.client.PerTableConfig$PerTableConfigBuilder.build(PerTableConfig.java:33)
	at io.onetable.utilities.RunSync.main(RunSync.java:119)

Generalize handling of concurrent writes during sync

OneTableClient has logic to handle the case where archival has already kicked in on the source, which requires a snapshot sync instead of incremental to catch up to the active timeline. However the logic coupled to Hudi API and needed to be deactivated. This issue tracks enabling it such that it works for any format.

Originally posted by @the-other-tim-brown in #9 (comment)

Iceberg source subtask: getCurrentCommitState

Delta source subtask: getCurrentCommitState

AWS Glue testing

Validate that the outputs of onetable can still be synced with AWS Glue.

Write up a guide for how to sync the outputs to Glue referencing existing documentation where appropriate.

Improve Iceberg catalog support

In the IcebergClient we use HadoopTables for creating and loading existing tables. We can provide better support for Iceberg catalogs by updating this by using the Iceberg CatalogUtil#loadCatalog to create a catalog instance from a user provided configuration and then use that catalog for our create and load.

Delta source subtask: getCurrentSnapshot

Source client specific testing

We should create a suite of tests that we can run for all 3 sources that validate that we can read commits of all different types into the common model.

Currently we are doing this as part of the ITOneTableClient but these tests are long and will only get longer as more sources are added.

Target client specific testing

Relates to #20

Similar to sources, we should also test that the targets are properly persisting the data provided from the common in-memory model.

All conversion paths tested

We should have a basic testing flows for:

Hudi -> Iceberg and Delta (already exists)
Iceberg -> Delta and Hudi
Delta -> Hudi and Iceberg

Change IcebergDataFileExtractor to take table as an arg to the methods it exposes

Originally posted by @the-other-tim-brown in #75 (comment)

Round trip conversions

For each pair of formats, we should have a test that can write in format A, sync to B, write an update in B, sync to A, and read the results in A

Snowflake query testing

Validate that all formats can be read by the Snowflake query engine after being written with Onetable.

Produce a user guide for how to integrate with this engine referencing existing documentation when appropriate.

[Enhancement] Add ability to extend Hadoop FS configs and ADLS Support

Currently, a subset of Hadoop FS config keys, supporting on S3 and GS, are hardcoded in OneTable. This approach limits capability to add new HDFS compliant FS or even customize configs for currently supported storage clients. I propose to decouple the configuration settings from the main application logic in RunSync. This involves:

Extracting the current configuration into an xml with default keys and values.
Taking a custom Hadoop config file as an input which could add new keys and/or override existing defaults.

By decoupling configurations, we can quickly test with new FS clients.

Presto Query Testing

Validate that all formats can be read by the Presto query engine after being written with Onetable.

Produce a user guide for how to integrate with this engine referencing existing documentation when appropriate.

Decouple HudiClient and OneTableClient

Concurrency Testing

We should setup a test suite of cases that simulate a multi-writer scenario in a given source format and make sure that we can properly sync to the target tables.

Real world example: Hudi can run its clustering asynchronously and perform a clustering for some partitions while doing new updates on others.

Iceberg source subtask: getSchemaCatalog

Add Iceberg Table Format as a table data source

Subtasks

Externalize OneTable client lists and configuration
Add interface for OneTable source client provider (factory)
Migrate HudiClient to source provider
#8
#10

Open Telemetry Integration

Instrument the project with open telemetry or something similar to get a better understanding of the performance characteristics of Onetable

Replace TableFormat enum with extensible config file

TableFormat enum should be removed from OneTable as it restricts extensibility. Currently it forces a rebuild if a new format needs to be supported. In practice, there could be multiple versions of the open formats and users may have their own formats also.
One option is to externalize the list of clients in a config file, which contains a default list of client names associated with default implementations and configurations.

commit1
synced to target instant (so last synced instant = commit1)
commit2 (upsert records in commit1)
commit3 (compaction) replace base file in the fileslice.
commit4 (inserts)
commit5 (clean -> which removes base file generated in commit1).
Incremental sync now doesn't cause the older base file from commit to be ever replaced and hence we should avoid incremental sync. More specifically if the cleaner acts on a commit newer than last synced instant in the target client we should fall back to snapshot.

apache / incubator-xtable Goto Github PK

incubator-xtable's Introduction

Apache XTable™ (Incubating)

Building the project and running tests.

Style guide

Running the bundled jar

Contributing

Setup

Adding a new target format

Overview of the sync process

incubator-xtable's People

Contributors

Stargazers

Watchers

Forkers

incubator-xtable's Issues

Subtasks

Recommend Projects

Recommend Topics

Recommend Org

Jobs