GithubHelp home page GithubHelp logo

apache / incubator-xtable Goto Github PK

View Code? Open in Web Editor NEW
699.0 26.0 107.0 14.08 MB

Apache XTable (incubating) is a cross-table converter for lakehouse table formats that facilitates interoperability across data processing systems and query engines.

Home Page: https://xtable.apache.org/

License: Apache License 2.0

Java 90.95% JavaScript 1.02% CSS 6.22% HTML 1.52% MDX 0.29%
apache-hudi apache-iceberg delta-lake

incubator-xtable's Introduction

Apache XTable™ (Incubating)

Build Status

Apache XTable™ (Incubating) is a cross-table converter for table formats that facilitates omni-directional interoperability across data processing systems and query engines. Currently, Apache XTable™ supports widely adopted open-source table formats such as Apache Hudi, Apache Iceberg, and Delta Lake.

Apache XTable™ simplifies data lake operations by leveraging a common model for table representation. This allows users to write data in one format while still benefiting from integrations and features available in other formats. For instance, Apache XTable™ enables existing Hudi users to seamlessly work with Databricks's Photon Engine or query Iceberg tables with Snowflake. Creating transformations from one format to another is straightforward and only requires the implementation of a few interfaces, which we believe will facilitate the expansion of supported source and target formats in the future.

Building the project and running tests.

  1. Use Java11 for building the project. If you are using some other java version, you can use jenv to use multiple java versions locally.
  2. Build the project using mvn clean package. Use mvn clean package -DskipTests to skip tests while building.
  3. Use mvn clean test or mvn test to run all unit tests. If you need to run only a specific test you can do this by something like mvn test -Dtest=TestDeltaSync -pl core.
  4. Similarly, use mvn clean verify or mvn verify to run integration tests.

Style guide

  1. We use Maven Spotless plugin and Google java format for code style.
  2. Use mvn spotless:check to find out code style violations and mvn spotless:apply to fix them. Code style check is tied to compile phase by default, so code style violations will lead to build failures.

Running the bundled jar

  1. Get a pre-built bundled jar or create the jar with mvn install -DskipTests
  2. create a yaml file that follows the format below:
sourceFormat: HUDI
targetFormats:
  - DELTA
  - ICEBERG
datasets:
  -
    tableBasePath: s3://tpc-ds-datasets/1GB/hudi/call_center
    tableDataPath: s3://tpc-ds-datasets/1GB/hudi/call_center/data
    tableName: call_center
    namespace: my.db
  -
    tableBasePath: s3://tpc-ds-datasets/1GB/hudi/catalog_sales
    tableName: catalog_sales
    partitionSpec: cs_sold_date_sk:VALUE
  -
    tableBasePath: s3://hudi/multi-partition-dataset
    tableName: multi_partition_dataset
    partitionSpec: time_millis:DAY:yyyy-MM-dd,type:VALUE
  -
    tableBasePath: abfs://[email protected]/multi-partition-dataset
    tableName: multi_partition_dataset
  • sourceFormat is the format of the source table that you want to convert
  • targetFormats is a list of formats you want to create from your source tables
  • tableBasePath is the basePath of the table
  • tableDataPath is an optional field specifying the path to the data files. If not specified, the tableBasePath will be used. For Iceberg source tables, you will need to specify the /data path.
  • namespace is an optional field specifying the namespace of the table and will be used when syncing to a catalog.
  • partitionSpec is a spec that allows us to infer partition values. This is only required for Hudi source tables. If the table is not partitioned, leave it blank. If it is partitioned, you can specify a spec with a comma separated list with format path:type:format
    • path is a dot separated path to the partition field
    • type describes how the partition value was generated from the column value
      • VALUE: an identity transform of field value to partition value
      • YEAR: data is partitioned by a field representing a date and year granularity is used
      • MONTH: same as YEAR but with month granularity
      • DAY: same as YEAR but with day granularity
      • HOUR: same as YEAR but with hour granularity
    • format: if your partition type is YEAR, MONTH, DAY, or HOUR specify the format for the date string as it appears in your file paths
  1. The default implementations of table format converters can be replaced with custom implementations by specifying a converter configs yaml file in the format below:
# conversionSourceProviderClass: The class name of a table format's converter factory, where the converter is
#     used for reading from a table of this format. All user configurations, including hadoop config
#     and converter specific configuration, will be available to the factory for instantiation of the
#     converter.
# conversionTargetProviderClass: The class name of a table format's converter factory, where the converter is
#     used for writing to a table of this format.
# configuration: A map of configuration values specific to this converter.
tableFormatConverters:
    HUDI:
      conversionSourceProviderClass: org.apache.xtable.hudi.HudiConversionSourceProvider
    DELTA:
      conversionTargetProviderClass: org.apache.xtable.delta.DeltaConversionTarget
      configuration:
        spark.master: local[2]
        spark.app.name: xtable
  1. A catalog can be used when reading and updating Iceberg tables. The catalog can be specified in a yaml file and passed in with the --icebergCatalogConfig option. The format of the catalog config file is:
catalogImpl: io.my.CatalogImpl
catalogName: name
catalogOptions: # all other options are passed through in a map
  key1: value1
  key2: value2
  1. run with java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml [--hadoopConfig hdfs-site.xml] [--convertersConfig converters.yaml] [--icebergCatalogConfig catalog.yaml] The bundled jar includes hadoop dependencies for AWS, Azure, and GCP. Sample hadoop configurations for configuring the converters can be found in the xtable-hadoop-defaults.xml file. The custom hadoop configurations can be passed in with the --hadoopConfig [custom-hadoop-config-file] option. The config in custom hadoop config file will override the default hadoop configurations. For an example of a custom hadoop config file, see hadoop.xml.

Contributing

Setup

For setting up the repo on IntelliJ, open the project and change the java version to Java11 in File->ProjectStructure img.png

You have found a bug, or have a cool idea you that want to contribute to the project ? Please file a GitHub issue here

Adding a new target format

Adding a new target format requires a developer implement ConversionTarget. Once you have implemented that interface, you can integrate it into the ConversionController. If you think others may find that target useful, please raise a Pull Request to add it to the project.

Overview of the sync process

img.png

incubator-xtable's People

Contributors

alberttwong avatar arifazmidd avatar ashvina avatar danielscamacho avatar daragu avatar dependabot[bot] avatar dipankarmazumdar avatar hussein-awala avatar iemejia avatar jackwener avatar jcamachor avatar kywe665 avatar lmccay avatar lordozb avatar poojanilangekar avatar sagarlakshmipathy avatar the-other-tim-brown avatar vamshigv avatar wuchunfu avatar yihua avatar zabetak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

incubator-xtable's Issues

Reduce cases in ITOnetableClient

Relates to #20 and #21

Once the above are completed, we can remove the number of cases we're covering in the integration tests and focus on just a handful key flows as sanity checks that components are working together as expected.

Iterative execution of OneTable failed in deserializing OneSchema

I am reporting this error for awareness. At the moment I am unsure if the test setup is causing this failure. In my test I copied an existing Hudi table from one Az Storage account to another. Then, I executed OneTable conversion twice. The first execution completed successfully. The second attempt failed in deserializing the schema.
Below is the stack trace for reference. As I mentioned earlier, this may be a setup issue.

2023-08-29 09:52:41 INFO  io.onetable.client.OneTableClient:128 - OneTable Sync is successful for the following formats [DELTA, ICEBERG]

$ java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --configFilePath my_config.yaml  -h onetable-hadoop.xml
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-08-29 09:52:52 INFO  io.onetable.utilities.RunSync:96 - Running sync for basePath abfs://[email protected]/call_center for following table formats [DELTA, ICEBERG]
2023-08-29 09:52:52 INFO  org.apache.hudi.common.table.HoodieTableMetaClient:130 - Loading HoodieTableMetaClient from abfs://[email protected]/call_center
2023-08-29 09:52:52 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-08-29 09:52:52 WARN  org.apache.hadoop.fs.azurebfs.utils.SSLSocketFactoryEx:131 - Failed to load OpenSSL. Falling back to the JSSE default.
2023-08-29 09:52:54 INFO  org.apache.hudi.common.table.HoodieTableConfig:268 - Loading table properties from abfs://[email protected]/call_center/.hoodie/hoodie.properties
2023-08-29 09:52:54 INFO  org.apache.hudi.common.table.HoodieTableMetaClient:149 - Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from abfs://[email protected]/call_center
2023-08-29 09:52:54 INFO  org.apache.hudi.common.table.HoodieTableMetaClient:152 - Loading Active commit timeline for abfs://[email protected]/call_center
2023-08-29 09:52:54 INFO  org.apache.hudi.common.table.timeline.HoodieActiveTimeline:171 - Loaded instants upto : Option{val=[20230702230621688__replacecommit__COMPLETED]}
2023-08-29 09:52:54 ERROR io.onetable.utilities.RunSync:116 - Error running sync for abfs://[email protected]/call_center
io.onetable.exception.OneIOException: Failed to get one table state
	at io.onetable.client.OneTableClient.getSyncState(OneTableClient.java:302) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
	at io.onetable.client.OneTableClient.sync(OneTableClient.java:103) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
	at io.onetable.utilities.RunSync.main(RunSync.java:114) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
Caused by: com.fasterxml.jackson.databind.exc.InvalidDefinitionException: Invalid type definition for type `io.onetable.model.schema.OneSchema`: Argument #0 of constructor [constructor for `io.onetable.model.schema.OneSchema` (6 args), annotations: {interface com.fasterxml.jackson.annotation.JsonCreator=@com.fasterxml.jackson.annotation.JsonCreator(mode=DEFAULT)} has no property name (and is not Injectable): can not use as property-based Creator
 at [Source: (StringReader); line: 1, column: 1]
	at com.fasterxml.jackson.databind.exc.InvalidDefinitionException.from(InvalidDefinitionException.java:62) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
...
	at com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:142) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
	at com.fasterxml.jackson.databind.DeserializationContext.findRootValueDeserializer(DeserializationContext.java:642) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
...
	at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3597) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
	at io.onetable.persistence.OneTableStateStorage.readOneTableState(OneTableStateStorage.java:142) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]
	at io.onetable.persistence.OneTableStateStorage.read(OneTableStateStorage.java:107) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:?]

BigLake Metastore Testing

Validate that the outputs of onetable can still be synced with BigLake metastore.

Write up a guide for how to sync the outputs to BigLake referencing existing documentation where appropriate.

Perform testing with large number of commits and partitions

We need to run testing for each source and target with a large number of partitions and commits in a single table (on the order of 10k commits and partitions for starters). This will help us understand any bottlenecks or improvements that can be made in our initial sync flow.

Fabric Query Testing

Validate that all formats can be read by the Fabric query engine after being written with Onetable.

Produce a user guide for how to integrate with this engine referencing existing documentation when appropriate.

Redshift spectrum query testing

Validate that all formats can be read by the Redshift query engine after being written with Onetable.

Produce a user guide for how to integrate with this engine referencing existing documentation when appropriate.

Enable CI for macOS

5d60018 set up CI for Linux (build and run tests) by relying on Azure Pipelines. I'm creating this issue to 1) validate that it is working as intended, and 2) enable CI for macOS as well.

Add Hudi Target Client

With the 0.14.0 release, there is support for handling arbitrary file names in Hudi. This was previously a blocker for making Hudi a target for tables written in the other formats.

Properly reflect rollbacks/restores in target tables

Right now when we see a rollback or restore in the source table, we just treat it as files being removed from the table. We should update this to instead issue a rollback command in the target tables so that the histories are more consistent between the source and target.

Mandatory tableName not set in main method and missing in sample config

tableName is mandatory in PerTableConfig for launching OneTable application. However it is not being set within the main method. Additionally, it is missing in the sample config file provided. As such, the example as presented will fail.

Exception in thread "main" java.lang.NullPointerException: tableName is marked non-null but is null
	at io.onetable.client.PerTableConfig.<init>(PerTableConfig.java:33)
	at io.onetable.client.PerTableConfig$PerTableConfigBuilder.build(PerTableConfig.java:33)
	at io.onetable.utilities.RunSync.main(RunSync.java:119)

Generalize handling of concurrent writes during sync

OneTableClient has logic to handle the case where archival has already kicked in on the source, which requires a snapshot sync instead of incremental to catch up to the active timeline. However the logic coupled to Hudi API and needed to be deactivated. This issue tracks enabling it such that it works for any format.

Originally posted by @the-other-tim-brown in #9 (comment)

AWS Glue testing

Validate that the outputs of onetable can still be synced with AWS Glue.

Write up a guide for how to sync the outputs to Glue referencing existing documentation where appropriate.

Improve Iceberg catalog support

In the IcebergClient we use HadoopTables for creating and loading existing tables. We can provide better support for Iceberg catalogs by updating this by using the Iceberg CatalogUtil#loadCatalog to create a catalog instance from a user provided configuration and then use that catalog for our create and load.

Source client specific testing

We should create a suite of tests that we can run for all 3 sources that validate that we can read commits of all different types into the common model.

Currently we are doing this as part of the ITOneTableClient but these tests are long and will only get longer as more sources are added.

Target client specific testing

Relates to #20

Similar to sources, we should also test that the targets are properly persisting the data provided from the common in-memory model.

All conversion paths tested

We should have a basic testing flows for:

  1. Hudi -> Iceberg and Delta (already exists)
  2. Iceberg -> Delta and Hudi
  3. Delta -> Hudi and Iceberg

Round trip conversions

For each pair of formats, we should have a test that can write in format A, sync to B, write an update in B, sync to A, and read the results in A

Snowflake query testing

Validate that all formats can be read by the Snowflake query engine after being written with Onetable.

Produce a user guide for how to integrate with this engine referencing existing documentation when appropriate.

[Enhancement] Add ability to extend Hadoop FS configs and ADLS Support

Currently, a subset of Hadoop FS config keys, supporting on S3 and GS, are hardcoded in OneTable. This approach limits capability to add new HDFS compliant FS or even customize configs for currently supported storage clients. I propose to decouple the configuration settings from the main application logic in RunSync. This involves:

  • Extracting the current configuration into an xml with default keys and values.
  • Taking a custom Hadoop config file as an input which could add new keys and/or override existing defaults.

By decoupling configurations, we can quickly test with new FS clients.

Presto Query Testing

Validate that all formats can be read by the Presto query engine after being written with Onetable.

Produce a user guide for how to integrate with this engine referencing existing documentation when appropriate.

Concurrency Testing

We should setup a test suite of cases that simulate a multi-writer scenario in a given source format and make sure that we can properly sync to the target tables.

Real world example: Hudi can run its clustering asynchronously and perform a clustering for some partitions while doing new updates on others.

Open Telemetry Integration

Instrument the project with open telemetry or something similar to get a better understanding of the performance characteristics of Onetable

Replace TableFormat enum with extensible config file

TableFormat enum should be removed from OneTable as it restricts extensibility. Currently it forces a rebuild if a new format needs to be supported. In practice, there could be multiple versions of the open formats and users may have their own formats also.
One option is to externalize the list of clients in a config file, which contains a default list of client names associated with default implementations and configurations.

HMS Testing

Validate that the outputs of onetable can still be synced with Hive metastore.

Write up a guide for how to sync the outputs to HMS referencing existing documentation where appropriate.

BigQuery query testing

Validate that all formats can be read by the BigQuery query engine after being written with Onetable.

Produce a user guide for how to integrate with this engine referencing existing documentation when appropriate.

Remove Onetable state

The current implementation relies on Onetable's own state file which causes increased IO. We want to transition the source to only rely on the source format to avoid this extra overhead and improve performance.

Trino Query Testing

Validate that all formats can be read by the Trino query engine after being written with Onetable.

Produce a user guide for how to integrate with this engine referencing existing documentation when appropriate.

Unity Catalog Testing

Validate that the outputs of onetable can still be synced with Unity catalog.

Write up a guide for how to sync the outputs to Unity referencing existing documentation where appropriate.

Inconsitency in Incremental Sync when processing Clean commits

There will be inconsistency by doing incremental sync in processing commits from hudi source in following scenario. We should fall back to recognize this and fall back to Snapshot and avoid inconsistencies.

MOR table

  1. commit1
  2. synced to target instant (so last synced instant = commit1)
  3. commit2 (upsert records in commit1)
  4. commit3 (compaction) replace base file in the fileslice.
  5. commit4 (inserts)
  6. commit5 (clean -> which removes base file generated in commit1).
  7. Incremental sync now doesn't cause the older base file from commit to be ever replaced and hence we should avoid incremental sync. More specifically if the cleaner acts on a commit newer than last synced instant in the target client we should fall back to snapshot.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.