GithubHelp home page GithubHelp logo

absaoss / hyperdrive Goto Github PK

View Code? Open in Web Editor NEW
42.0 16.0 13.0 1.66 MB

Extensible streaming ingestion pipeline on top of Apache Spark

License: Apache License 2.0

Scala 99.23% Shell 0.77%
ingestion spark-structured-streaming spark kafka streaming pipeline framework apache-spark streaming-etl

hyperdrive's Introduction

Copyright 2018 ABSA Group Limited

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Hyperdrive - An extensible streaming ingestion pipeline on top of Apache Spark

Build Status

master develop
Build Status Build Status

What is Hyperdrive?

Hyperdrive is a configurable and scalable ingestion platform that allows data movement and transformation from streaming sources with exactly-once fault-tolerance semantics by using Apache Spark Structured Streaming.

In Hyperdrive, each ingestion is defined by the three components reader, transformer and writer. This separation allows adapting to different streaming sources and sinks, while reusing transformations common across multiple ingestion pipelines.

Motivation

Similar to batch processing, data ingestion pipelines are needed to process streaming data sources. While solutions for data pipelines exist, exactly-once fault-tolerance in streaming processing is an intricate problem and cannot be solved with the same strategies that exist for batch processing.

This is the gap the Hyperdrive aims to fill, by leveraging the exactly-once guarantee of Spark's Structured Streaming and by providing a flexible data pipeline.

Architecture

The data ingestion pipeline of Hyperdrive consists of four components: readers, transformers, writers.

  • Readers define how to connect to sources, e.g. how to connect to Kafka in a secure cluster by providing security directives, which topic and brokers to connect to.
  • Transformers define transformations to be applied to the decoded DataFrame, e.g. dropping columns.
  • Writers define where DataFrames should be sent after the transformations, e.g. into HDFS as Parquet files.

Built-in components

  • KafkaStreamReader - reads from a Kafka topic.
  • ParquetStreamReader - reads Parquet files from a source directory.
  • ConfluentAvroDecodingTransformer - decodes the payload as Confluent Avro (through ABRiS), retrieving the schema from the specified Schema Registry. This transformer is capable of seamlessly handling whatever schemas the payload messages are using.
  • ConfluentAvroEncodingTransformer - encodes the payload as Confluent Avro (through ABRiS), updating the schema to the specified Schema Registry. This transformer is capable of seamlessly handling whatever schema the dataframe is using.
  • ColumnSelectorStreamTransformer - selects all columns from the decoded DataFrame.
  • AddDateVersionTransformerStreamWriter - adds columns for ingestion date and an auto-incremented version number, to be used for partitioning.
  • ParquetStreamWriter - writes the DataFrame as Parquet, in append mode.
  • KafkaStreamWriter - writes to a Kafka topic.
  • DeltaCDCToSnapshotWriter - writes the DataFrame in Delta format. It expects CDC events and performs merge logic and creates the latest snapshot table.
  • DeltaCDCToSCD2Writer - writes the DataFrame in Delta format. It expects CDC events and performs merge logic and creates SCD2 table.
  • HudiCDCToSCD2Writer - writes the DataFrame in Hudi format. It expects CDC events and performs merge logic and creates SCD2 table.

Custom components

Custom components can be implemented using the Component Archetype following the API defined in the package za.co.absa.hyperdrive.ingestor.api

  • A custom component has to be a class which extends either of the abstract classes StreamReader, StreamTransformer or StreamWriter
  • The class needs to have a companion object which implements the corresponding trait StreamReaderFactory, StreamTransformerFactory or StreamWriterFactory
  • The implemented components have to be packaged to a jar file, which can then be added to the classpath of the driver. To use a component, it has to be configured as described under Usage

After that, the new component will be able to be seamlessly invoked from the driver.

Usage

Hyperdrive has to be executed with Spark. Due to Spark-Kafka integration issues, it will only work with Spark 2.3 and higher.

How to run

git clone [email protected]:AbsaOSS/hyperdrive.git
mvn clean package

Given a configuration file has already been created, hyperdrive can be executed as follows:

spark-submit --class za.co.absa.hyperdrive.driver.drivers.PropertiesIngestionDriver driver/target/driver*.jar config.properties

Alternatively, configuration properties can also be passed as command-line arguments

spark-submit --class za.co.absa.hyperdrive.driver.drivers.CommandLineIngestionDriver driver/target/driver*.jar \
  component.ingestor=spark \
  component.reader=za.co.absa.hyperdrive.ingestor.implementation.reader.kafka.KafkaStreamReader \
  # more properties ...

Configuration

The configuration file may be created from the template located at driver/src/resources/Ingestion.properties.template.

CommandLineIngestionDriverDockerTest may be consulted for a working pipeline configuration.

General settings

Pipeline settings
Property Name Required Description
component.ingestor Yes Defines the ingestion pipeline. Only spark is currently supported.
component.reader Yes Fully qualified name of reader component, e.g.za.co.absa.hyperdrive.ingestor.implementation.reader.kafka.KafkaStreamReader
component.transformer.id.{order} No An arbitrary but unique string, referenced in this documentation as {transformer-id}
component.transformer.class.{transformer-id} No Fully qualified name of transformer component, e.g. za.co.absa.hyperdrive.ingestor.implementation.transformer.column.selection.ColumnSelectorStreamTransformer
component.writer Yes Fully qualified name of writer component, e.g. za.co.absa.hyperdrive.ingestor.implementation.writer.parquet.ParquetStreamWriter

Multiple transformers can be configured in the pipeline, including multiple instances of the same transformer. For each transformer instance, component.transformer.id.{order} and component.transformer.class.{transformer-id} have to specified, where {order} and {transformer-id} need to be unique. In the above table, {order} must be an integer and may be negative. {transformer-id} is only used within the configuration to identify which configuration options belong to a certain transformer instance.

Spark settings
Property Name Required Description
ingestor.spark.termination.method No Either processAllAvailable (stop query when no more messages are incoming) or awaitTermination (stop query on signal, e.g. Ctrl-C). Default: awaitTermination. See also Combination of trigger and termination method
ingestor.spark.await.termination.timeout No Timeout in milliseconds. Stops query when timeout is reached. This option is only valid with termination method awaitTermination

Settings for built-in components

KafkaStreamReader
Property Name Required Description
reader.kafka.topic Yes The name of the kafka topic to ingest data from. Equivalent to Spark property subscribe
reader.kafka.brokers Yes List of kafka broker URLs . Equivalent to Spark property kafka.bootstrap.servers

Any additional properties for kafka can be added with the prefix reader.option.. E.g. the property kafka.security.protocol can be added as reader.option.kafka.security.protocol

See e.g. the Structured Streaming + Kafka Integration Guide for optional kafka properties.

ParquetStreamReader

The parquet stream reader infers the schema from parquet files that already exist in the source directory. If no file exists, the reader will fail.

Property Name Required Description
reader.parquet.source.directory Yes Source path for the parquet files. Equivalent to Spark property path for the DataStreamReader

Any additional properties can be added with the prefix reader.parquet.options.. See Spark Structured Streaming Documentation

ConfluentAvroDecodingTransformer

The ConfluentAvroDecodingTransformer is built on ABRiS. More details about the configuration properties can be found there. Caution: The ConfluentAvroDecodingTransformer requires the property reader.kafka.topic to be set.

Property Name Required Description
transformer.{transformer-id}.schema.registry.url Yes URL of Schema Registry, e.g. http://localhost:8081. Equivalent to ABRiS property SchemaManager.PARAM_SCHEMA_REGISTRY_URL
transformer.{transformer-id}.value.schema.id Yes The schema id. Use latest or explicitly provide a number. Equivalent to ABRiS property SchemaManager.PARAM_VALUE_SCHEMA_ID
transformer.{transformer-id}.value.schema.naming.strategy Yes Subject name strategy of Schema Registry. Possible values are topic.name, record.name or topic.record.name. Equivalent to ABRiS property SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY
transformer.{transformer-id}.value.schema.record.name Yes for naming strategies record.name and topic.record.name Name of the record. Equivalent to ABRiS property SchemaManager.PARAM_SCHEMA_NAME_FOR_RECORD_STRATEGY
transformer.{transformer-id}.value.schema.record.namespace Yes for naming strategies record.name and topic.record.name Namespace of the record. Equivalent to ABRiS property SchemaManager.PARAM_SCHEMA_NAMESPACE_FOR_RECORD_STRATEGY
transformer.{transformer-id}.consume.keys No If set to true, keys will be consumed and added as columns to the dataframe. Key columns will be prefixed with key__
transformer.{transformer-id}.key.schema.id Yes if consume.keys is true The schema id for the key.
transformer.{transformer-id}.key.schema.naming.strategy Yes if consume.keys is true Subject name strategy for key
transformer.{transformer-id}.key.schema.record.name Yes for key naming strategies record.name and topic.record.name Name of the record.
transformer.{transformer-id}.key.schema.record.namespace Yes for key naming strategies record.name and topic.record.name Namespace of the record.
transformer.{transformer-id}.keep.columns No Comma-separated list of columns to keep (e.g. offset, partition)
transformer.{transformer-id}.disable.nullability.preservation No Set to true to ignore fix #137 and to keep the same behaviour as for versions prior to and including v3.2.2. Default value: false
transformer.{transformer-id}.schema.registry.basic.auth.user.info.file No A path to a text file, that contains one line in the form <username>:<password>. It will be passed as basic.auth.user.info to the schema registry config
transformer.{transformer-id}.use.advanced.schema.conversion No Set to true to convert the avro schema using AdvancedAvroToSparkConverter, which puts default value and underlying avro type to struct field metadata. Default false

For detailed information on the subject name strategy, please take a look at the Schema Registry Documentation.

Any additional properties for the schema registry config can be added with the prefix transformer.{transformer-id}.schema.registry.options.

Note: use.advanced.schema.conversion only works with a patched version of Spark, due to bug SPARK-34805. For the latest version of Spark, the patch is available in apache/spark#35270. For other versions of Spark, the changes need to be cherry-picked and built locally.

ConfluentAvroEncodingTransformer

The ConfluentAvroEncodingTransformer is built on ABRiS. More details about the configuration properties can be found there. Caution: The ConfluentAvroEncodingTransformer requires the property writer.kafka.topic to be set.

Property Name Required Description
transformer.{transformer-id}.schema.registry.url Yes URL of Schema Registry, e.g. http://localhost:8081. Equivalent to ABRiS property SchemaManager.PARAM_SCHEMA_REGISTRY_URL
transformer.{transformer-id}.value.schema.naming.strategy Yes Subject name strategy of Schema Registry. Possible values are topic.name, record.name or topic.record.name. Equivalent to ABRiS property SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY
transformer.{transformer-id}.value.schema.record.name Yes for naming strategies record.name and topic.record.name Name of the record. Equivalent to ABRiS property SchemaManager.PARAM_SCHEMA_NAME_FOR_RECORD_STRATEGY
transformer.{transformer-id}.value.schema.record.namespace Yes for naming strategies record.name and topic.record.name Namespace of the record. Equivalent to ABRiS property SchemaManager.PARAM_SCHEMA_NAMESPACE_FOR_RECORD_STRATEGY
transformer.{transformer-id}.value.optional.fields No Comma-separated list of nullable value columns that should get default value null in the avro schema. Nested columns' names should be concatenated with the dot (.)
transformer.{transformer-id}.produce.keys No If set to true, keys will be produced according to the properties key.column.prefix and key.column.names of the Hyperdrive Context
transformer.{transformer-id}.key.schema.naming.strategy Yes if produce.keys is true Subject name strategy for key
transformer.{transformer-id}.key.schema.record.name Yes for key naming strategies record.name and topic.record.name Name of the record.
transformer.{transformer-id}.key.schema.record.namespace Yes for key naming strategies record.name and topic.record.name Namespace of the record.
transformer.{transformer-id}.key.optional.fields No Comma-separated list of nullable key columns that should get default value null in the avro schema. Nested columns' names should be concatenated with the dot (.)
transformer.{transformer-id}.schema.registry.basic.auth.user.info.file No A path to a text file, that contains one line in the form <username>:<password>. It will be passed as basic.auth.user.info to the schema registry config
transformer.{transformer-id}.use.advanced.schema.conversion No Set to true to convert the avro schema using AdvancedSparkToAvroConverter, which reads default value and underlying avro type from struct field metadata. Default false

Any additional properties for the schema registry config can be added with the prefix transformer.{transformer-id}.schema.registry.options.

ColumnSelectorStreamTransformer
Property Name Required Description
transformer.{transformer-id}.columns.to.select Yes Comma-separated list of columns to select. * can be used to select all columns. Only existing columns using column names may be selected (i.e. expressions cannot be constructed)
AddDateVersionTransformer

The AddDateVersionTransformer adds the columns hyperdrive_date and hyperdrive_version. hyperdrive_date is the ingestion date (or a user-defined date), while hyperdrive_version is a number automatically incremented with every ingestion, starting at 1. For the auto-increment to work, hyperdrive_date and hyperdrive_version need to be defined as partition columns. Caution: This transformer requires a writer which defines writer.parquet.destination.directory.

Property Name Required Description
transformer.{transformer-id}.report.date No User-defined date for hyperdrive_date in format yyyy-MM-dd. Default date is the date of the ingestion
ColumnRenamingStreamTransformer

ColumnRenamingStreamTransformer allows renaming of columns specified in the configuration.

To add the transformer to the pipeline use this class name:

component.transformer.class.{transformer-id} = za.co.absa.hyperdrive.ingestor.implementation.transformer.column.renaming.ColumnRenamingStreamTransformer
Property Name Required Description
transformer.{transformer-id}.columns.rename.from Yes A comma-separated list of columns to rename. For example, column1, column2.
transformer.{transformer-id}.columns.rename.to Yes A comma-separated list of new column names. For example, column1_new, column2_new.
ColumnCopyStreamTransformer

ColumnCopyStreamTransformer allows copying of columns specified in the configuration. Dots in column names are interpreted as nested structs, unless they are surrounded by backticks (same as Spark convention)

Note that usage of the star-operator * within column names is not supported and may lead to unexpected behaviour.

To add the transformer to the pipeline use this class name:

component.transformer.class.{transformer-id} = za.co.absa.hyperdrive.ingestor.implementation.transformer.column.copy.ColumnCopyStreamTransformer
Property Name Required Description
transformer.{transformer-id}.columns.copy.from Yes A comma-separated list of columns to copy from. For example, column1.fieldA, column2.fieldA.
transformer.{transformer-id}.columns.copy.to Yes A comma-separated list of new column names. For example, newColumn.col1_fieldA, newColumn.col2_fieldA.

Example

Given a dataframe with the following schema

 |-- column1
 |    |-- fieldA
 |    |-- fieldB
 |-- column2
 |    |-- fieldA
 |-- column3

Then, the following column parameters

  • transformer.{transformer-id}.columns.copy.from=column1.fieldA, column2.fieldA
  • transformer.{transformer-id}.columns.copy.to=newColumn.col1_fieldA, newColumn.col2_fieldA

will produce the following schema

 |-- column1
 |    |-- fieldA
 |    |-- fieldB
 |-- column2
 |    |-- fieldA
 |-- column3
 |-- newColumn
 |    |-- col1_fieldA
 |    |-- col2_fieldA

DeduplicateKafkaSinkTransformer

DeduplicateKafkaSinkTransformer deduplicates records in a query from a Kafka source to a Kafka destination in a rerun after a failure. Records are identified across source and destination topic by a user-defined id, which may be a composite id and may include consumer record properties such as offset, partition, but also fields from the key or value schema. Deduplication is needed because the Kafka-destination provides only a at-least-once guarantee. Deduplication works by getting the ids from the last partial run in the destination topic and excluding them in the query.

Note that there must be only one source and one destination topic, and there must be only one writer writing to the destination topic, and no records must have been written to the destination topic after the partial run. Otherwise, records may still be duplicated.

To use this transformer, KafkaStreamReader, ConfluentAvroDecodingTransformer, ConfluentAvroEncodingTransformer and KafkaStreamWriter must be configured as well.

Note that usage of the star-operator * within column names is not supported and may lead to unexpected behaviour.

To add the transformer to the pipeline use this class name:

component.transformer.class.{transformer-id} = za.co.absa.hyperdrive.ingestor.implementation.transformer.deduplicate.kafka.DeduplicateKafkaSinkTransformer
Property Name Required Description
transformer.{transformer-id}.source.id.columns Yes A comma-separated list of consumer record properties that define the composite id. For example, offset, partition or key.some_user_id.
transformer.{transformer-id}.destination.id.columns Yes A comma-separated list of consumer record properties that define the composite id. For example, value.src_offset, value.src_partition or key.some_user_id.
transformer.{transformer-id}.kafka.consumer.timeout No Kafka consumer timeout in seconds. The default value is 120s.

The following fields can be selected on the consumer record

  • topic
  • offset
  • partition
  • timestamp
  • timestampType
  • serializedKeySize
  • serializedValueSize
  • key
  • value

In case of key and value, the fields of their schemas can be specified by adding a dot, e.g. key.some_nested_record.some_id or likewise value.some_nested_record.some_id

See Pipeline settings for details about {transformer-id}.

ParquetStreamWriter
Property Name Required Description
writer.parquet.destination.directory Yes Destination path of the sink. Equivalent to Spark property path for the DataStreamWriter
writer.parquet.partition.columns No Comma-separated list of columns to partition by.
writer.parquet.metadata.check No Set this option to true if the consistency of the metadata log should be checked prior to the query. For very large tables, the check may be very expensive
writer.common.trigger.type No See Combination writer properties
writer.common.trigger.processing.time No See Combination writer properties

Any additional properties for the DataStreamWriter can be added with the prefix writer.parquet.options, e.g. writer.parquet.options.key=value

KafkaStreamWriter
Property Name Required Description
writer.kafka.topic Yes The name of the kafka topic to ingest data from. Equivalent to Spark property topic
writer.kafka.brokers Yes List of kafka broker URLs . Equivalent to Spark property kafka.bootstrap.servers
writer.common.trigger.type No See Combination writer properties
writer.common.trigger.processing.time No See Combination writer properties
MongoDbStreamWriter
Property Name Required Description
writer.mongodb.uri Yes Output MongoDB URI, e.g. mongodb://host:port/database.collection.
writer.mongodb.database No Database name (if not specified as the part of URI).
writer.mongodb.collection No Collection name (if not specified as the part of URI).
writer.common.trigger.type No See Combination writer properties
writer.common.trigger.processing.time No See Combination writer properties

Any additional properties for the DataStreamWriter can be added with the prefix writer.mongodb.options, e.g. writer.mongodb.options.key=value

Common MongoDB additional options

Property Name Default Description
writer.mongodb.option.spark.mongodb.output.ordered true When set to false inserts are done in parallel, increasing performance, but the order of documents is not preserved.
writer.mongodb.option.spark.mongodb.output.forceInsert false Forces saves to use inserts, even if a Dataset contains _id.
More on these options: https://docs.mongodb.com/spark-connector/current/configuration
DeltaCDCToSnapshotWriter
Property Name Required Description
writer.deltacdctosnapshot.destination.directory Yes Destination path of the sink. Equivalent to Spark property path for the DataStreamWriter
writer.deltacdctosnapshot.partition.columns No Comma-separated list of columns to partition by.
writer.deltacdctosnapshot.key.column Yes A column with unique entity identifier.
writer.deltacdctosnapshot.operation.column Yes A column containing value marking a record with an operation.
writer.deltacdctosnapshot.operation.deleted.values Yes Values marking a record for deletion in the operation column.
writer.deltacdctosnapshot.precombineColumns Yes When two records have the same key value, we will pick the one with the largest value for precombine columns. Evaluated in provided order.
writer.deltacdctosnapshot.precombineColumns.customOrder No Precombine column's custom order in ascending order.
writer.common.trigger.type No See Combination writer properties
writer.common.trigger.processing.time No See Combination writer properties

Any additional properties for the DataStreamWriter can be added with the prefix writer.deltacdctosnapshot.options, e.g. writer.deltacdctosnapshot.options.key=value

Example

  • component.writer=za.co.absa.hyperdrive.compatibility.impl.writer.cdc.delta.snapshot.DeltaCDCToSnapshotWriter
  • writer.deltacdctosnapshot.destination.directory=/tmp/destination
  • writer.deltacdctosnapshot.key.column=key
  • writer.deltacdctosnapshot.operation.column=ENTTYP
  • writer.deltacdctosnapshot.operation.deleted.values=DL,FD
  • writer.deltacdctosnapshot.precombineColumns=TIMSTAMP, ENTTYP
  • writer.deltacdctosnapshot.precombineColumns.customOrder.ENTTYP=PT,FI,RR,UB,UP,DL,FD
DeltaCDCToSCD2Writer
Property Name Required Description
writer.deltacdctoscd2.destination.directory Yes Destination path of the sink. Equivalent to Spark property path for the DataStreamWriter
writer.deltacdctoscd2.partition.columns No Comma-separated list of columns to partition by.
writer.deltacdctoscd2.key.column Yes A column with unique entity identifier.
writer.deltacdctoscd2.timestamp.column Yes A column with timestamp.
writer.deltacdctoscd2.operation.column Yes A column containing value marking a record with an operation.
writer.deltacdctoscd2.operation.deleted.values Yes Values marking a record for deletion in the operation column.
writer.deltacdctoscd2.precombineColumns Yes When two records have the same key and timestamp value, we will pick the one with the largest value for precombine columns. Evaluated in provided order.
writer.deltacdctoscd2.precombineColumns.customOrder No Precombine column's custom order in ascending order.
writer.common.trigger.type No See Combination writer properties
writer.common.trigger.processing.time No See Combination writer properties

Any additional properties for the DataStreamWriter can be added with the prefix writer.deltacdctoscd2.options, e.g. writer.deltacdctoscd2.options.key=value

Example

  • component.writer=za.co.absa.hyperdrive.compatibility.impl.writer.cdc.delta.scd2.DeltaCDCToSCD2Writer
  • writer.deltacdctoscd2.destination.directory=/tmp/destination
  • writer.deltacdctoscd2.key.column=key
  • writer.deltacdctoscd2.timestamp.column=TIMSTAMP
  • writer.deltacdctoscd2.operation.column=ENTTYP
  • writer.deltacdctoscd2.operation.deleted.values=DL,FD
  • writer.deltacdctoscd2.precombineColumns=ENTTYP
  • writer.deltacdctoscd2.precombineColumns.customOrder.ENTTYP=PT,FI,RR,UB,UP,DL,FD
HudiCDCToSCD2Writer
Property Name Required Description
writer.hudicdctoscd2.destination.directory Yes Destination path of the sink. Equivalent to Spark property path for the DataStreamWriter
writer.hudicdctoscd2.partition.columns No Comma-separated list of columns to partition by.
writer.hudicdctoscd2.key.column Yes A column with unique entity identifier.
writer.hudicdctoscd2.timestamp.column Yes A column with timestamp.
writer.hudicdctoscd2.operation.column Yes A column containing value marking a record with an operation.
writer.hudicdctoscd2.operation.deleted.values Yes Values marking a record for deletion in the operation column.
writer.hudicdctoscd2.precombineColumns Yes When two records have the same key and timestamp value, we will pick the one with the largest value for precombine columns. Evaluated in provided order.
writer.hudicdctoscd2.precombineColumns.customOrder No Precombine column's custom order in ascending order.
writer.common.trigger.type No See Combination writer properties
writer.common.trigger.processing.time No See Combination writer properties

Any additional properties for the DataStreamWriter can be added with the prefix writer.hudicdctoscd2.options, e.g. writer.hudicdctoscd2.options.key=value

Example

  • component.writer=za.co.absa.hyperdrive.compatibility.impl.writer.cdc.hudi.scd2.HudiCDCToSCD2Writer
  • writer.hudicdctoscd2.destination.directory=/tmp/destination
  • writer.hudicdctoscd2.key.column=key
  • writer.hudicdctoscd2.timestamp.column=TIMSTAMP
  • writer.hudicdctoscd2.operation.column=ENTTYP
  • writer.hudicdctoscd2.operation.deleted.values=DL,FD
  • writer.hudicdctoscd2.precombineColumns=ENTTYP
  • writer.hudicdctoscd2.precombineColumns.customOrder.ENTTYP=PT,FI,RR,UB,UP,DL,FD

Common writer properties

Property Name Required Description
writer.common.checkpoint.location Yes Used for Spark property checkpointLocation. The checkpoint location has to be unique among different workflows.
writer.common.trigger.type No Either Once for one-time execution or ProcessingTime for micro-batch executions for micro-batch execution. Default: Once.
writer.common.trigger.processing.time No Interval in ms for micro-batch execution (using ProcessingTime). Default: 0ms, i.e. execution as fast as possible.

Behavior of Triggers

Trigger (writer.common.trigger.type) Termination method (ingestor.spark.termination.method) Runtime Details
Once AwaitTermination or ProcessAllAvailable Limited Consumes all data that is available at the beginning of the micro-batch. The query processes exactly one micro-batch and stops then, even if more data would be available at the end of the micro-batch.
Once AwaitTermination with timeout Limited Same as above, but terminates at the timeout. If the timeout is reached before the micro-batch is processed, it won't be completed and no data will be committed.
ProcessingTime ProcessAllAvailable Only long-running if topic continuously produces messages, otherwise limited Consumes all available data in micro-batches and only stops when no new data arrives, i.e. when the available offsets are the same as in the previous micro-batch. Thus, it completely depends on the topic, if and when the query terminates.
ProcessingTime AwaitTermination with timeout Limited Consumes data in micro-batches and only stops when the timeout is reached or the query is killed.
ProcessingTime AwaitTermination Long-running Consumes data in micro-batches and only stops when the query is killed.
  • Note 1: The first micro-batch of the query will contain all available messages to consume and can therefore be quite large, even if the trigger ProcessingTime is configured, and regardless of what micro-batch interval is configured. To limit the size of a micro-batch, the property reader.option.maxOffsetsPerTrigger should be used. See also http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
  • Note 2: It's possible to define a timeout for trigger Once. If the timeout is reached before the micro-batch is processed, it won't be completed and no data will be committed. Such a behavior seems quite unpredictable and therefore we don't recommend it.

See the Spark Documentation for more information about triggers.

Hyperdrive Context

HyperdriveContext is an object intended to be used by the components to share data. It is a key-value store, where the key is a string and the value can be of any type. The following context variables are currently used by the default implementation.

Name Type Description
key.column.prefix String If ConfluentAvroDecodingTransformer is configured to consume keys, it prefixes the key columns with key__ such that they can be distinguished in the dataframe. If key__ happens to be a prefix of a value column, a random alphanumeric string is used instead.
key.column.names Seq[String] If ConfluentAvroDecodingTransformer is configured to consume keys, it contains the original column names (without prefix) in the key schema.

Secrets Providers

AWS SecretsManager
Property Name Required Description
secretsprovider.config.providers.<provider-id>.class Yes The fully qualified class name of the secrets provider. is an arbitrary string. Multiple secrets providers can be configured by supplying multiple s
secretsprovider.config.defaultprovider No The of the secrets provider to be used by default
secretsprovider.secrets.<secret-id>.options.secretname Yes The Secret name of the secret in AWS Secrets Manager. is an arbitrary string. Multiple secrets can be configured by supplying multiple s
secretsprovider.secrets.<secret-id>.options.provider No The of the secrets provider to be used for this specific secret.
secretsprovider.secrets.<secret-id>.options.readasmap No Set to true if the secret should be interpreted as a json map, set to false if the value should be read as is. Default: true
secretsprovider.secrets.<secret-id>.options.key No If the secret should be read as a map, specify the key whose value should be extracted as the secret
secretsprovider.secrets.<secret-id>.options.encoding No Decodes the secret. Valid values: base64

The Secrets Provider will fill the configuration property secretsprovider.secrets.<secret-id>.secretvalue with the secret value. This configuration key will be available for string interpolation to be used by other configuration properties.

Example

  • secretsprovider.config.providers.awssecretsmanager.class=za.co.absa.hyperdrive.driver.secrets.implementation.aws.AwsSecretsManagerSecretsProvider
  • secretsprovider.secrets.truststorepassword.provider=awssecretsmanager
  • secretsprovider.secrets.truststorepassword.options.secretname=<the-secret-name>
  • secretsprovider.secrets.truststorepassword.options.key=<the-secret-key>
  • secretsprovider.secrets.truststorepassword.options.encoding=base64
  • reader.option.kafka.ssl.truststore.password=${secretsprovider.secrets.truststorepassword.secretvalue}

Other

Hyperdrive uses Apache Commons Configuration 2. This allows properties to be referenced, e.g. like so

transformer.[avro.decoder].schema.registry.url=http://localhost:8081
writer.kafka.schema.registry.url=${transformer.[avro.decoder].schema.registry.url}

Workflow Manager

Hyperdrive ingestions may be triggered using the Workflow Manager, which is developed in a separate repository: https://github.com/AbsaOSS/hyperdrive-trigger

A key feature of the Workflow Manager are triggers, which define when an ingestion should be executed and how it should be requested. The workflow manager supports cron-based triggers as well as triggers that listen to a notification topic.

How to build

  • Scala 2.12, Spark 2.4 (default)
mvn clean install
  • Scala 2.12, Spark 3.0
mvn clean install -Pscala-2.12,spark-3
  • Scala 2.11, Spark 2.4
mvn scala-cross-build:change-version -Pscala-2.11,spark-2
mvn clean install -Pscala-2.11,spark-2
mvn scala-cross-build:restore-version

E2E tests with Docker

E2E tests require a running Docker instance on the executing machine and are not executed by default. To execute them, build using the profile all-tests

mvn clean test -Pall-tests

How to measure code coverage

./>mvn clean install -Pscala-2.ZY,spark-Z,code-coverage

If module contains measurable data the code coverage report will be generated on path:

{project-path}\hyperdrive\{module}\target\site\jacoco

hyperdrive's People

Contributors

anilpinarozdemir avatar ashishtiwary avatar benedeki avatar dependabot[bot] avatar felipemmelo avatar jozefbakus avatar kevinwallimann avatar miroslavpojer avatar senelesithole avatar yruslan avatar zejnilovic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hyperdrive's Issues

Refactor SparkIngestor to a configurable component

Currently, the SparkIngestor doesn't accept any configuration. This will be needed for long-running jobs to choose either StreamingQuery.awaitTermination or StreamingQuery.processAllAvailable.

Additionally, it's currently not possible to pass configuration options to the sparksession via the config file or command line

Tasks

  • Make SparkIngestor a class that accepts a SparkSession and Configuration which is instantiated by a companion object, much like the other components e.g. KafkaStreamReader etc.
  • Pass all options with prefix ingestor.spark.options. on to the SparkSession

Update Readme

Update readme with descriptions for spark summit, and add documentation for components

Express ConfluentAvroKafkaDecoder as transformer

KafkaStreamReader.read should return DataFrame
ConfluentAvroKafkaDecoder.decode could then accept DataFrame and return DataFrame, making it a Transformer, e.g. za.co.absa.hyperdrive.ingestor.implementation.transformer.avro.ConfluentAvroDecoderTransformer

Then, the Decoder component is not needed anymore.

How to migrate Hyperdrive-Trigger

  1. Replace
"component.decoder=za.co.absa.hyperdrive.ingestor.implementation.decoder.avro.confluent.ConfluentAvroKafkaStreamDecoder"

with

"component.transformer.id.0=confluent.avro.decoder", "component.transformer.class.confluent.avro.decoder=za.co.absa.hyperdrive.ingestor.implementation.transformer.avro.ConfluentAvroDecoderTransformer"
  1. Replace
decoder.avro

with

transformer.confluent.avro.decoder

Refactor ParquetPartitioningStreamWriter to transformer

ParquetPartitioningStreamWriter does two things: It adds two columns (i.e. transformation) and writes the dataframe partitioned (special write). With #116 the two responsibilities can be separated: ParquetStreamWriter is enhanced to write partitioned. Thus, only the transformation is left for ParquetPartitioningStreamWriter.

Tasks

  • Refactor ParquetPartitioningStreamWriter to a transformer and rename
  • Merge AbstractParquetStreamWriter with ParquetStreamWriter

How to migrate Hyperdrive-Trigger

  1. Replace
"component.writer=za.co.absa.hyperdrive.ingestor.implementation.writer.parquet.ParquetPartitioningStreamWriter"

with

"component.transformer.id.2=add.date.version", "component.transformer.class.add.date.version=za.co.absa.hyperdrive.ingestor.implementation.transformer.add.dateversion.AddDateVersionTransformer",
"component.writer=za.co.absa.hyperdrive.ingestor.implementation.writer.parquet.ParquetStreamWriter"
  1. Replace
writer.parquet.partitioning.report.date

with

transformer.add.date.version.report.date
  1. Replace
"writer.parquet.destination

with

"transformer.add.date.version.destination=${writer.parquet.destination}", "writer.parquet.partition.columns=hyperdrive_date, hyperdrive_version", "writer.parquet.destination

Make sure there is no workflow using ParquetPartitioning and partition columns at the same time

Clean up shared module

Currently, not all classes in module shared are really used in multiple modules. With #99 the module needs to be published to maven. To keep its footprint small, only classes whose responsibilities span across multiple modules should be kept in shared. All other classes should be scoped package private on za.co.absa.hyperdrive.

Tasks

  • ConfigurationKeys: Move to ingestor-default package, only move object IngestorKeys to driver module.
  • IngestionException and IngestionStartException: Move to driver module
  • FileUtils, ConfigUtils should be moved to ingestor-default package. Also merge it with the SchemaRegistrySettingsUtil from #108 . They should be made package private on za.co.absa.hyperdrive
  • SparkUtils should be removed since it has no usages. With it, also TestSparkUtils should be removed. Then, the dependency on testutils can be removed. testutils can be removed altogether.

Remove unreachable code in KafkaStreamReader

Currently, the else-block in the following snippet in KafkaStreamReader is unreachable.

    val optionalKeys = getKeysFromPrefix(configuration, rootFactoryOptionalConfKey)

    val extraConfs = optionalKeys.foldLeft(Map[String,String]()) {
      case (map,securityKey) =>
        getOrNone(securityKey, configuration) match {
          case Some(value) => map + (tweakOptionKeyName(securityKey) -> value)
          case None => map
        }
    }

    if (extraConfs.isEmpty || extraConfs.size == optionalKeys.size) {
      extraConfs
    }
    else {
      logger.warn(s"Assuming no security settings, since some appear to be missing: {${findMissingKeys(optionalKeys, extraConfs)}}")
      Map[String,String]()
    }

optionalKeys gets keys from the configuration and extraConfs gets the optionalKeys again from configuration. configuration does not change between the two calls. Therefore, extraConfs will always have the same size as optionalKeys

Moreover, even if some settings were missing, they should not be removed, as this could lead to unexpected error messages downstream.
Therefore, the if-else-block should be removed.

Don't use multiple = characters to encode extra configuration

Currently, extra configuration (with keys only known at runtime) is provided as follows:
writer.parquet.extra.conf.1=key1=value1

This format is not intuitive. Also the .1 in the key does not convey any information. Therefore, the extra configuration should be changed to
writer.parquet.option.key=value

To that end, Configuration.subset and Configuration.keys might be used

Unfortunately, this is inconsistent with reader.option.key=value, but it's consistent with all other configuration properties which include an identifier for the component. Arguably, reader.option.key=value should be changed to reader.kafka.option.key=value even though this results in properties like reader.kafka.option.kafka.security.protocol

Breaking changes
Configuration property naming pattern for extra configuration changes from writer.parquet.extra.conf.1=key1=value1 to writer.parquet.options.key1=value1

Fix inconsistencies between CommandLineIngestorDriver and PropertiesIngestorDriver

Fix inconsistencies:

  1. In CommandLineIngestorDriver whitespace in key and value is not trimmed, while it is trimmed in PropertiesIngestorDriver. E.g. key1 = value1 will be stored as key1 (with whitespace at end) with CommandLineIngestorDriver, but as key1 in PropertiesIngestorDriver. The behavior of PropertiesIngestorDriver is the expected one.

  2. An empty property, i.e. some.key= causes an exception in CommandLineIngestorDriver, but is allowed in PropertiesIngestorDriver. This is inconsistent. Either it should always cause an exception, or it should be allowed. Probably the behavior of PropertiesIngestorDriver should be favoured since it directly uses the behaviour of Configuration2. Users familiar with Configuration2 would find a differing behaviour unexpected.

Add Jenkinsfile to the project

Add basic Jenkinsfile to the project so we can do automatic builds on a creation of a PR
Consult @hamc17

def hyperdriveSlaveLabel = getHyperdriveSlaveLabel()
def toolVersionJava = getToolVersionJava()
def toolVersionMaven = getToolVersionMaven()
def toolVersionGit = getToolVersionGit()
def mavenSettingsId = getMavenSettingsId()

pipeline {
    agent {
        label "${hyperdriveSlaveLabel}"
    }
    tools {
        jdk "${toolVersionJava}"
        maven "${toolVersionMaven}"
        git "${toolVersionGit}"
    }
    options {
        buildDiscarder(logRotator(numToKeepStr: '20'))
        timestamps()
    }
    stages {
        stage ('Build') {
            steps {
                configFileProvider([configFile(fileId: "${mavenSettingsId}", variable: 'MAVEN_SETTINGS_XML')]) {
                    sh "mvn -s $MAVEN_SETTINGS_XML clean package"
                }
            }
        }
    }
}

Define default error message

The methods getOrThrow and getSeqOrThrow in ConfigUtils might throw an IllegalArgumentException without error message. This may lead to errors that are hard to understand.

Sensible default messages should be provided.

Refactor ConfigurationKeys

Remove ConfigurationKeys. Move all objects to the newly created ...Attributes objects to break the coupling induced by ConfigurationKeys

Configuration property keys of components should be accessible via reflection

Currently, the configuration property keys are hard-coded in the ingestor (or any custom component jar). This makes it impossible to infer the required configuration properties, given just the jars. It should be possible that an external program (e.g. the hyperdrive-trigger) can read the available configuration properties of a component (through reflection) if the jars are given.

To that end, every ComponentFactory should implement a method that returns a list of configuration properties. Furthermore, this list should also contain information whether a configuration property is required or optional, and some validation rules.
Some components may depend on each other, e.g. the CheckpointOffsetManager requires the KafkaStreamReader to be configured (or at least the property reader.kafka.topic must be defined). This should be taken into account as well.

Possible breakdown of this issue:

  1. List of config properties with required/optional info.
  2. Add validation rules to list
  3. Provide e.g. another method to define dependencies on other components (multiple components, either component A or component B,...)

Migration from older versions
Classes that implemented StreamReaderFactory, OffsetManagerFactory, StreamDecoderFactory, StreamTransformerFactory or StreamWriterFactory must now implement the trait HasComponentAttributes

Remove retention policy

With ABRiS 3.1.0, the retention policy is removed from the library. SchemaRetentionPolicies are not needed for version 3 since you can easily select what you need using standard Spark functions.

Already now, the retention policy is never used, even though it is a mandatory configuration property. Therefore, any retention policy configuration should be removed.

  1. Remove getSchemaRetentionPolicy from ConfluentAvroKafkaStreamDecoder
  2. Remove KEY_SCHEMA_RETENTION_POLICY from ConfigurationKeys
  3. Remove decoder.avro.schema.retention.policy from Ingestion.properties.template

Driver - if no properties file found, use the template

I suggest that if the compiler does not find any properties file in the resources of the driver module, to copy over the template which, if I understand it correctly should be usable for the basic use case of the hyperdrive and maybe to demonstrate how it works.

This is done with modules within Enceladus as well from a pom file. The plugin used is maven-antrun-plugin please check here

Move CheckpointOffsetManager logic to Reader and Writer

The StreamManager has two methods with signatures

  def configure(streamReader: DataStreamReader, configuration: Configuration): DataStreamReader

  def configure(streamWriter: DataStreamWriter[Row], configuration: Configuration): DataStreamWriter[Row]

It's hard to imagine any other use case than the checkpoint location which would justify having the concept of a stream manager which configures solely reader and writer (but not the transformer for example). In fact, the implementation of the CheckpointOffsetManager can hardly be reused for a different data source since the concept of starting offsets is tightly coupled to kafka. Moreover, for the reader config, the checkpoint location is only needed to determine the starting offsets, which is an implementation detail. Arguably, it may be surprising to the developer that the starting offsets are not set in the KafkaStreamReader but in the CheckpointOffsetManager
For all these reasons, I believe the extra indirection from having a StreamManager concept is not justified.

manager.checkpoint.base.location can be replaced by reader.kafka.checkpoint.base.location and writer.kafka.base.location and writer.parquet.base.location Another possibility is to replace it by a "top-level" property spark.ingestor.checkpoint.base.location since the checkpoint location is mandatory for every structured streaming query (see https://github.com/apache/spark/blob/695cb617d42507eded9c7e50bc7cd5333bbe6f83/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L259)

Decision
Replace manager.checkpoint.base.location by writer.common.checkpoint.base.location because the checkpoint location is a config property for the DataStreamWriter per the Spark documentation. It's least surprising to add it as a writer property here as well. Unfortunately, the reader will have to depend on this property as well.
Also, rename checkpoint.base.location to checkpoint.location, since the concept of a base location has been given up in #85

Migration for Trigger
Replace

manager.checkpoint.base.location

with

writer.common.checkpoint.location

Replace

"component.manager=za.co.absa.hyperdrive.ingestor.implementation.manager.checkpoint.CheckpointOffsetManager",

with
(empty string)

Write E2E test running in cluster mode

Some issues only occur in cluster mode and not in client mode, e.g. serialization issues. To catch these errors, a e2e test (probably using Docker) should be created that sets up a cluster and runs the ingestor in cluster mode.

Add Trigger.ProcessingTime to Writers

Currently, all jobs are ingested with Trigger.Once, i.e. all data is ingested into one parquet file (per kafka partition). Certain jobs may produce very large output files, leading to out of memory errors.

To prevent this, Trigger.ProcessingTime should be used.

New configuration property: writer.parquet.trigger
The expected value is the number of milliseconds.
If the value is not a number or the property is not present, all data should be ingested at once, as it is the case now.

The change should be available for both ParquetStreamWriter and ParquetPartitioningStreamWriter

Preserve nullability from Avro to Catalyst Schema

Currently, in a Kafka-to-Kafka (i.e. Avro -> Catalyst -> Avro) workflow (with columnselectortransformer), all fields are always nullable in the destination topic.
Example:
Source schema

{
  "type": "record",
  "name": "pageviews",
  "namespace": "ksql",
  "fields": [
    {
      "name": "viewtime",
      "type": "long"
    }
  ]
}

is written as

{
  "type": "record",
  "name": "pageviews",
  "namespace": "ksql",
  "fields": [
    {
      "name": "viewtime",
      "type": ["long", "null"]
    }
  ]
}

Expected: Non-nullable fields in the source Avro schema should also be non-nullable in the destination.
Nullable fields should stay nullable obviously.

Migration note
Making an existing nullable field non-nullable is a forward-compatible change (it's almost like adding a field)

Extract encoding part of KafkaStreamWriter to transformer component

KafkaStreamWriter should not be dependent on Abris (confluent and Avro)
Thus, the encoding part should be extracted to a preceding transformer component

A new transformer should be created, e.g. za.co.absa.hyperdrive.ingestor.implementation.transformer.avro.ConfluentAvroEncoderTransformer

Breaking changes
Configuration properties will need to be adjusted

How to migrate Hyperdrive-Trigger

  1. Replace
"component.writer

with

"component.transformer.id.2=confluent.avro.encoder", "component.transformer.class.confluent.avro.encoder=za.co.absa.hyperdrive.ingestor.implementation.transformer.avro.ConfluentAvroEncoderTransformer",
"component.writer
  1. Replace
writer.kafka.schema

with

transformer.confluent.avro.encoder.schema
  1. Replace
writer.kafka.value

with

transformer.confluent.avro.encoder.value
  1. Replace
writer.kafka.produce.keys

with

transformer.confluent.avro.encoder.produce.keys
  1. Replace
writer.kafka.key

with

transformer.confluent.avro.encoder.key
  1. Replace
writer.kafka.option

with

transformer.confluent.avro.encoder.option

NoClassDefFoundError AbstractKafkaAvroSerDeConfig

Running the ingestor fails as of 01a1ecd with the following error message

13:26:56.095 [main] INFO  za.co.absa.hyperdrive.ingestor.implementation.manager.factories.OffsetManagerAbstractFactory$ - Going to load factory for configuration 'component.manager'.
13:26:56.100 [main] INFO  za.co.absa.hyperdrive.ingestor.implementation.manager.checkpoint.CheckpointOffsetManager$ - Going to create CheckpointOffsetManager instance using: topic='clickstream', checkpoint base location='/tmp/checkpoint-location'
13:26:56.102 [main] INFO  za.co.absa.hyperdrive.ingestor.implementation.decoder.factories.StreamDecoderAbstractFactory$ - Going to load factory for configuration 'component.decoder'.
Exception in thread "main" java.lang.NoClassDefFoundError: io/confluent/kafka/serializers/AbstractKafkaAvroSerDeConfig
	at za.co.absa.hyperdrive.ingestor.implementation.decoder.avro.confluent.ConfluentAvroKafkaStreamDecoder$.apply(ConfluentAvroKafkaStreamDecoder.scala:72)
	at za.co.absa.hyperdrive.ingestor.implementation.decoder.factories.StreamDecoderAbstractFactory$.build(StreamDecoderAbstractFactory.scala:39)
	at za.co.absa.hyperdrive.driver.IngestionDriver.getStreamDecoder(IngestionDriver.scala:63)
	at za.co.absa.hyperdrive.driver.IngestionDriver.ingest(IngestionDriver.scala:48)
	at za.co.absa.hyperdrive.driver.drivers.PropertiesIngestionDriver$.main(PropertiesIngestionDriver.scala:49)
	at za.co.absa.hyperdrive.driver.drivers.PropertiesIngestionDriver.main(PropertiesIngestionDriver.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: io.confluent.kafka.serializers.AbstractKafkaAvroSerDeConfig
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 18 more
19/11/28 13:26:56 INFO SparkContext: Invoking stop() from shutdown hook

io/confluent/kafka/serializers/AbstractKafkaAvroSerDeConfig is part of kafka-avro-serializer and was added as a test dependency for PR #59

The line in which it fails is
PARAM_SCHEMA_REGISTRY_URL -> getOrThrow(KEY_SCHEMA_REGISTRY_URL, configuration, errorMessage = s"Schema Registry URL not specified. Is '$KEY_SCHEMA_REGISTRY_URL' configured?")
Oddly enough, there is no reference to AbstractKafkaAvroSerDeConfig in the whole project.

Publish key from kafka source as key to kafka sink / HyperdriveContext

Currently, the key from a kafka source is not ingested in ConfluentAvroKafkaStreamDecoder. Also, no key is produced to the kafka sink in KafkaStreamWriter. It should be possible to publish the ingested keys along with the value.

Consuming keys is supported by Abris like so

val result: DataFrame  = dataFrame.select(
    from_confluent_avro(col("key"), keyRegistryConfig) as 'key,
    from_confluent_avro(col("value"), valueRegistryConfig) as 'value)

Tasks

  • Add config properties for decoder

    • decoder.avro.consume.keys (true / false)
    • decoder.avro.key.schema.naming.strategy
    • decoder.avro.key.schema.id
    • decoder.avro.key.schema.record.name
    • decoder.avro.key.schema.record.namespace
  • Add config properties for writer

    • writer.kafka.produce.keys (true / false)
    • writer.kafka.key.schema.naming.strategy
    • writer.kafka.key.schema.record.name
    • writer.kafka.key.schema.record.namespace
  • Add context object (HyperdriveContext) to which a value can be put and retrieved by key

    • To avoid key collisions, used keys should be documented
    • ConfluentAvroKafkaStreamDecoder should put the key column names, KafkaStreamWriter should retrieve them

Preserve avro schema from Avro to Catalyst and from Catalyst to Avro

Background

Currently, the org.apache.spark.sql.avro.SchemaConverters class is used to derive an avro type from a catalyst type. However, in a Avro-> Catalyst-> Avro query, this conversion is lossy. For example, information such as the default value, documentation or any custom json field in the avro schema is lost when converting to catalyst, and e.g. avro types BYTES and FIXED are both converted to the same catalyst type DecimalType (or BinaryType if no avro logical type is present).

For example, a BYTES type with logical type decimal

    "type" : {
      "type" : "bytes",
      "logicalType" : "decimal",
      "precision" : 8,
      "scale" : 3
    }

is converted to the Spark type DecimalType, which is in turn converted to the Avro type

    "type" : {
      "type" : "fixed",
      "name" : "fixed",
      "namespace" : "topLevelRecord.fieldName",
      "size" : 10,
      "logicalType" : "decimal",
      "precision" : 8,
      "scale" : 3
    }

Furthermore, default values in the source Avro schema are discarded in the Avro -> Catalyst conversion. This has two consequences: 1) Obviously, there's no way to generate a target Avro schema with that default value, and 2) a nullable type in avro is expressed as a union with null, where null is the first type in the union if and only if the default value is null. However, if the default value is unknown, there's no way to determine whether null should be the first or second type in the union.

For example, when a record schema with a default value

{
  "type" : "record",
  "name" : "topLevelRecord",
  "fields" : [ {
    "name" : "stringCol",
    "type" : [ "string", "null" ],
    "default" : "abcd"
  } ]
}

is converted to a StructType, the default value is lost

StructType(Seq(StructField("stringCol", StringType, nullable = true)))

and when the type is converted to an Avro type, the null type is in front of the string type in the union.

{
  "type" : "record",
  "name" : "topLevelRecord",
  "fields" : [ {
    "name" : "stringCol",
    "type" : [ "null", "string" ]
  } ]
}

Feature

There are two approaches.

  1. The source schema could be used as an input to generate the target schema. However, since the Spark schema can change due to any number of spark transformations, renamings, column additions, there's no generic straightforward way to map a field in the source schema to a field in the target schema. A heuristic approach would be needed to decide which fields of the input Avro schema correspond to the fields of the output Avro schema.

  2. The metadata object of Spark's StructField can be used to transport the information from the source Avro schema into the Spark schema and from there to the target Avro schema. This only works for Avro schemas, where the root type is a record type, i.e. it doesn't work if the Avro schema is just a map or an array, for example, because these would not be wrapped around by a StructField, but directly to a MapType or ArrayType, which don't have a metadata field, however.

Other

This issue only deals with default values and logical types. It may be extended to also support custom fields in the Avro schema in a separate issue.
See also #137
https://issues.apache.org/jira/browse/SPARK-28008

Create profile to run integration tests separately

Currently, no distinction is made among tests and all tests run with every build.

Now, a profile for integration tests should be created such that integration tests don't run with mvn clean test. The goal is to decrease the compile time while still discover errors early thanks to the unit tests.

For now, all tests with a SparkTestBase should be marked as integration tests. It's absolutely essential that integration tests run on Jenkins builds!

Remove nulls

There are many null checks for arguments in the codebase, e.g.

  override def write(dataFrame: DataFrame, streamManager: StreamManager): StreamingQuery = {
    if (dataFrame == null) {
      throw new IllegalArgumentException("Null DataFrame.")
    }

Since null should not be used in scala, an occurrence of null is a serious problem which cannot be handled. Throwing an IllegalArgumentException instead of letting the program fail with a NullPointerException adds little information.

Task

  • Remove null checks
  • Wrap null values coming from Java code in Option

Identify component factories by identifier string

Description
Currently, component factories are loaded in ClassLoaderUtils given their fully qualified classnames. The classname is passed by the configuration. (e.g. component.writer)

That means that components don't have the possibility to be refactored (renaming, moving to a different package) without introducing a breaking change which would require updating any existing configuration that uses that component.

Tasks

  • Add a method getIdentifier: String to the interface ComponentFactory. getClass.getName may be used as a default value. (so this feature won't be a breaking change)
  • Implementing components are responsible for providing a unique identifier. It's advisable to prefix the identifier with a human readable name, because it will be referenced in the configuration, logged, etc..
  • Use getIdentifier to load the factory in ClassLoaderUtils. Currently, it loads the class directly given the class name. This approach doesn't work to efficiently load the factory by the identifier. With #83, component factories can be loaded using the Service Provider Interface (SPI), i.e. with ServiceLoader. All factories expose the getIdentifier method, that's how it can be found

Other

  • The same identifier might be used by each component to prefix its configuration properties to avoid name clashes.

Support for multiple transformers

Currently, only one transformer can be specified.

It's likely that there might be a use-case in the future which requires multiple chainable transformers.

The configuration parameter component.transformer could take a comma-separated list instead of just one value. The order of the list would specify the order of execution of the transformers. That wouldn't support the same transformer to be used multiple times.

Components should be configured like this:

component.transformer.id.1=[id]
component.transformer.class.[id]=za.co.absa.hyperdrive.ingestor.my.transformer
transformer.[id].property.a = "value a"
transformer.[id].property.b = "value b"

component.transformer.id.2=csst
component.transformer.class.csst=za.co.absa.hyperdrive.ingestor.implementation.transformer.column.selection.ColumnSelectorStreamTransformer
transformer.csst.columns.to.select=*

component.transformer.id.3=csst2
component.transformer.class.csst2=za.co.absa.hyperdrive.ingestor.implementation.transformer.column.selection.ColumnSelectorStreamTransformer
transformer.csst2.columns.to.select="special_column"

Why the prefixes component.transformer.class and transformer? This prevents name conflicts

At runtime, transformers would only receive their specific config subset, i.e. in the above example, my-transformer gets property.a => "value a", property.b => "value b" in the transform method instead of the full configuration like now. If cross-component configuration is necessary, the HyperdriveContext may be utilized.

The order of the transformers is determined by the number after component.transformer.id. An error is thrown if it's not an integer. The number may be negative. Order numbers need not be consecutive, i.e. no error is thrown if one transformer has component.transformer.id.2 and the other component.transformer.id.-1

Tasks

  • StreamTransformerAbstractFactory.build should return a list of transformers.
  • build should call the apply method of the companion object only with a configuration subset using the id (e.g. csst in the above example)
  • SparkIngestor.ingest should accept a list of transformers and loop through them (fold)
  • Update tests

Note

  • As a by-product, transformers will be optional. If no transformer is specified, the list of transformers will be empty, thus the dataframe will directly be passed from the decoder (reader from v4.0.0) to the writer.

How to migrate
ColumnSelectorStreamTransformer

  • ConfigurationsKeys.ColumnSelectorStreamTransformerKeys.KEY_COLUMNS_TO_SELECT should be "columns.to.select"
  • Existing configuration in the Trigger DB
  1. Replace
"component.transformer=za.co.absa.hyperdrive.ingestor.implementation.transformer.column.selection.ColumnSelectorStreamTransformer"

with

"component.transformer.id.1=column.selector", "component.transformer.class.column.selector=za.co.absa.hyperdrive.ingestor.implementation.transformer.column.selection.ColumnSelectorStreamTransformer"
  1. Replace
transformer.columns.to.select=

with

transformer.column.selector.columns.to.select=

Alternatively, the whole column selector transformation config can be removed if all of the jobs only use select all.

HyperConformance

  • In za.co.absa.enceladus.conformance.HyperConformanceAttributes, search and replace s"$keysPrefix. with "
  • Existing configuration in the Trigger DB
    Replace
"component.transformer=za.co.absa.enceladus.conformance.HyperConformance"

with

"component.transformer.id.1=hyperconformance","component.transformer.class.hyperconformance=za.co.absa.enceladus.conformance.HyperConformance"

Transformer specific configuration already happens to be correct

Remove Finalizer from API

The Finalizer component was added to the API in 3e442c0. Its intended use-case was to copy the ingested data to another folder (raw / publish). However, this has been solved differently using a separate jar.
Moreover, the Finalizer may break the exactly-once fault-tolerance of the ingestor as a whole.
Since the Finalizer was not part of the API in version 1.0.0, removing the finalizer does not break backward-compatibility.

Make component constructors private

Currently, the components have public constructors, but it is only used from the companion object. The constructors should be made private. Then, the tests for the class and the object can also be merged.

Files

  • ConfluentAvroKafkaStreamDecoder
  • TestConfluentAvroKafkaStreamDecoder
  • TestConfluentAvroKafkaStreamDecoderObject
  • CheckpointOffsetManager
  • etc.

List delimiter does not work for CommandLineIngestor

Using the CommandLineIngestor a comma-delimited configuration value cannot be extracted as an array using getStringArray because no list delimiter handler is set for the configuration (as opposed to the PropertiesIngestionDriver

Fix it.

Refactor TempDir

za.co.absa.hyperdrive.testutils.TempDir may be refactored easily with Files.createTempDir.

Consider using commons.io.TempDirectory from absa-commons instead of TempDir and Files.createTempDir

Move dependencies to child poms

Currently, many dependencies are defined in the parent pom even though they are only used in one module. That unnecessarily bloats up the jars of the other modules. Moreover, it's hard to track which module really needs a dependency

Task

  • Dependencies should be declared in the child poms, unless the dependency is used throughout all submodules
  • Dependency management (version and scope) should still be done at the parent level.

ParquetStreamWriter shouldn't write if metadata is inconsistent

Problem description
When reading, Spark does not consider the metadata log if you read with a globbed path (e.g. /root-dir/*) or from a partitioned sub-directory (e.g. /root-dir/partition1=value1). Downstream applications are therefore at risk to read duplicated values in case of application failures and restarts.
Two cases for inconsistent metadata logs can be distinguished

  1. Metadata log contains files that are not on the filesystem: Most likely, parquet files have been deleted / moved manually.
  2. Parquet files are present which are not in metadata log: Most likely, this is due to a previous partial write. The parquet files should be removed.

In case 1), Spark will throw a FileNotFoundException in the next write. However in case 2), Spark does not throw any exception because this case is not an error from Spark's perspective.

Proposed solution

  • In the ParquetStreamWriter, the metadata log should be inspected and compared with the filesystem before writing. If it is inconsistent, a warning will be logged which lists the files to be deleted.
  • No automatic cleanup is considered, because partial writes are assumed to occur only rarely (even more rarely with https://issues.apache.org/jira/browse/SPARK-27210) and more importantly, automatic cleanups could result in inadvertent deletions if the metadata log has been tampered with.
  • An option to skip the check should be added, if the user knows what he does

This solution guarantees deduplicated reads for globbed paths and partitioned subdirectory reads, but doesn't guarantee atomicity, i.e. partial writes will be visible to downstream applications, but they will not be duplicated by subsequent writes.

Other

  • After #82 has been implemented, the functionality could be extracted to a transformer instead of having a switch on/off flag for the ParquetStreamWriter

OffsetManager should not expect topic as argument

Description
Currently, the base class OffsetManager expects a topic as a constructor argument. This is too specific to the use case of the CheckpointOffsetManager which currently requires the kafka topic (will be resolved in #85 ). In fact, this component does not need to manage offsets, but may manage anything which concerns both DataStreamReader and DataStreamWriter.

Tasks

  • OffsetManager should not require topic as a constructor argument.
  • OffsetManager should be renamed to StreamManager
  • The method OffsetManager.configureOffsets should be renamed to configure
  • Update archetype
  • Update readmes

How to migrate

  • Components implementing OffsetManager need to import StreamManager, rename configureOffsets to configure and remove the constructor argument.
  • No configuration properties need to be changed.

Property manager.checkpoint.base.location should contain complete path to checkpoint-location

Description
Currently, the checkpoint directory for a workflow is created as ${manager.checkpoint.base.location}/${reader.kafka.topic}. This makes no sense if the reader is not a kafka reader, but e.g. reading from a jdbc source.

Now, the ingestor should assume that the complete path is stored in${manager.checkpoint.base.location}. A complete path is also used in ${writer.parquet.destination.directory}

The user has to make sure that the checkpoint path is unique among workflows.

Consequences
Currently, an exception is thrown if manager.checkpoint.base.location does not exist. It's impossible to keep this behavior because startingOffsets is set to earliest if the resolved checkpoint location (base dir + topic) does not exist. After this PR, if manager.checkpoint.base.location is empty, no exception should be thrown, but startingOffsets should be set to earliest.

How to migrate
For all existing workflows, the property manager.checkpoint.base.location has to be appended with /${reader.kafka.topic}

Rename prefix for additional properties for KafkaStreamReader

Currently, additional properties for the KafkaStreamReader have to be specified with the prefix reader.options. E.g. reader.options.kafka.security.protocol or reader.options.kafka.ssl.key.password

This prefix is inconsistent with all other properties which start with decoder.avro., writer.parquet., manager.checkpoint or transformer.columns.. The properties for the KafkaStreamReader should start with reader.kafka., i.e. reader.kafka.options.kafka.security.protocol.

For additional properties like reader.options.failOnDataLoss or reader.options.minPartitions it's hard to find out which reader implementation they belong to.

Tasks

  • Change prefix for additional properties for KafkaStreamReader to reader.kafka.options.

How to migrate

  • All property keys starting with reader.options have to be replaced by reader.kafka.options

StreamWriter should not require destination as constructor argument

Description
Currently, the StreamWriter requires a destination as a constructor argument. However, e.g. writing to Kafka does not require a destination, but rather a topic.

Tasks

  • Remove the constructor argument destination from StreamWriter
  • Remove the default implementation of StreamWriter.getDestination and implement it in the derived classes.
  • Update archetype

Consequences

  • Currently, the destination directory is removed if the ingestion fails and the directory has been empty before the ingestion. This functionality assumes that the getDestination method of the StreamWriter returns a path on hdfs. However, this cannot be guaranteed. getDestination may return any string and it can't be assumed that it signifies a path.
    The functionality is not very useful in practice since the destination folder is empty only at the very first ingestion. Therefore, this functionality will be removed.

How to migrate

  • External components implementing StreamWriter need to remove the destination parameter from the constructor to the superclass, i.e. replace extends StreamWriter(destination) with extends StreamWriter
  • No configuration properties need to be changed

Use prefix writer.parquet.option for extra configuration to the parquet writers

Currently, extra configuration to the parquet writers needs to be passed like this:

writer.parquet.extra.conf.1=key=value

key=value is split at the = sign, which is very confusing and unexpected.
Extra configuration should be added with a prefix
writer.parquet.option.key=value

Unfortunately, this is inconsistent with reader.option.key=value, but it's consistent with all other configuration properties which include an identifier for the component. Arguably, reader.option.key=value should be changed to reader.kafka.option.key=value even though this results in properties like reader.kafka.option.kafka.security.protocol

Pass an instance of HyperdriveContext instead of accessing singleton object

In #114 , HyperdriveContext was introduced to maintain state across components. To avoid a breaking change it was introduced as a singleton object. However, this requires the state to be stored in a var or a mutable map.

Now, HyperdriveContext should be a class that is passed to the components (in the read, transform, write method etc.). Then, the state map can be an immutable val.

AllNullableParquetStreamWriter. AnalysisException: Queries with streaming sources must be executed with writeStream.start()

Hi @felipemmelo

I got this error when trying to ingest data with the AllNullableParquetStreamWriter

Exception in thread "main" za.co.absa.hyperdrive.shared.exceptions.IngestionStartException: NOT STARTED ingestion b4d8fd3d-43ec-4802-86b5-112d1aa62fb7. This exception was thrown during the starting of the ingestion job. Check the logs for details.
	at za.co.absa.hyperdrive.driver.SparkIngestor$.ingest(SparkIngestor.scala:104)
	at za.co.absa.hyperdrive.driver.IngestionDriver.ingest(IngestionDriver.scala:51)
	at za.co.absa.hyperdrive.driver.drivers.PropertiesIngestionDriver$.main(PropertiesIngestionDriver.scala:50)
	at za.co.absa.hyperdrive.driver.drivers.PropertiesIngestionDriver.main(PropertiesIngestionDriver.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
kafka
	at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:389)
	at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:38)
	at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:36)
	at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:51)
	at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:62)
	at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:60)
	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
	at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3037)
	at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3035)
	at za.co.absa.hyperdrive.shared.utils.SparkUtils$.setAllColumnsNullable(SparkUtils.scala:26)
	at za.co.absa.hyperdrive.ingestor.implementation.writer.parquet.AllNullableParquetStreamWriter.write(AllNullableParquetStreamWriter.scala:37)
	at za.co.absa.hyperdrive.driver.SparkIngestor$.ingest(SparkIngestor.scala:101)
	... 15 more
19/09/13 16:11:35 INFO SparkContext: Invoking stop() from shutdown hook

Ingestion.properties.template.txt

stacktrace.txt

I think the problem is that a new Dataframe is created in SparkUtils.setAllColumnsNullable instead of somehow reusing the old dataframe. Then we have two dataframes. One is associated to the reader and the other one to the writer

Publish all modules

Currently, the modules driver, ingestor-default and shared and testutils are not published to maven, because their API should not be public and backwards-compatibility within a major version is not guaranteed.
The disadvantage is that the main executable jar is not published on maven. However, the main jar should be available to download without the user having to build it himself.

Both goals can be achieved by publishing all modules to maven central, but making all classes in driver, ingestor-default, shared and testutils package-private to za.co.absa.hyperdrive which effectively marks them as a "private API". Users may download the main jar and execute it as is, but will still be prevented to use the code (unless they create a package with the same name)

Tasks

  • From the mentioned modules, remove the following from the pom files
            <plugin>
                <groupId>org.sonatype.plugins</groupId>
                <artifactId>nexus-staging-maven-plugin</artifactId>
                <version>${nexus.staging.plugin.version}</version>
                <configuration>
                    <skipNexusStagingDeployMojo>${skip.internal.modules.deployment}</skipNexusStagingDeployMojo>
                </configuration>
            </plugin>
  • Make all classes in the mentioned modules package private to za.co.absa.hyperdrive

Add generic partitioning option to ParquetStreamWriter

Currently, it's not possible to write a dataframe partitioned by arbitrary columns. This feature should be added in this PR.

Tasks

  • Add a configuration option writer.parquet.partitionby. It should accept a comma-separated list of column names
  • If present, the AbstractParquetStreamWriter should call .partitionBy on the DataStreamWriter

Related info
The ParquetPartitioningStreamWriter writes a dataframe partitioned by the current date and with an incrementing version number. However, this is very specialized logic and deserves a dedicated component, but should have a less general name (maybe rename in separate PR). In fact, with this PR, the ParquetPartitioningStreamWriter could be rewritten as a transformer, since it mainly adds two columns

Create end to end test

Currently, there is no end to end test that covers the whole pipeline from the KafkaStreamReader until the Parquet Stream Writers.

A test should be written that covers all of the pipeline.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.