absaoss / hyperdrive Goto Github PK

Extensible streaming ingestion pipeline on top of Apache Spark

License: Apache License 2.0

Scala 99.23% Shell 0.77%

ingestion spark-structured-streaming spark kafka streaming pipeline framework apache-spark streaming-etl

hyperdrive's Introduction

Copyright 2018 ABSA Group Limited

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Hyperdrive - An extensible streaming ingestion pipeline on top of Apache Spark

Build Status

master	develop

What is Hyperdrive?

Hyperdrive is a configurable and scalable ingestion platform that allows data movement and transformation from streaming sources with exactly-once fault-tolerance semantics by using Apache Spark Structured Streaming.

In Hyperdrive, each ingestion is defined by the three components reader, transformer and writer. This separation allows adapting to different streaming sources and sinks, while reusing transformations common across multiple ingestion pipelines.

Motivation

Similar to batch processing, data ingestion pipelines are needed to process streaming data sources. While solutions for data pipelines exist, exactly-once fault-tolerance in streaming processing is an intricate problem and cannot be solved with the same strategies that exist for batch processing.

This is the gap the Hyperdrive aims to fill, by leveraging the exactly-once guarantee of Spark's Structured Streaming and by providing a flexible data pipeline.

Architecture

The data ingestion pipeline of Hyperdrive consists of four components: readers, transformers, writers.

Readers define how to connect to sources, e.g. how to connect to Kafka in a secure cluster by providing security directives, which topic and brokers to connect to.
Transformers define transformations to be applied to the decoded DataFrame, e.g. dropping columns.
Writers define where DataFrames should be sent after the transformations, e.g. into HDFS as Parquet files.

Built-in components

KafkaStreamReader - reads from a Kafka topic.
ParquetStreamReader - reads Parquet files from a source directory.
ConfluentAvroDecodingTransformer - decodes the payload as Confluent Avro (through ABRiS), retrieving the schema from the specified Schema Registry. This transformer is capable of seamlessly handling whatever schemas the payload messages are using.
ConfluentAvroEncodingTransformer - encodes the payload as Confluent Avro (through ABRiS), updating the schema to the specified Schema Registry. This transformer is capable of seamlessly handling whatever schema the dataframe is using.
ColumnSelectorStreamTransformer - selects all columns from the decoded DataFrame.
AddDateVersionTransformerStreamWriter - adds columns for ingestion date and an auto-incremented version number, to be used for partitioning.
ParquetStreamWriter - writes the DataFrame as Parquet, in append mode.
KafkaStreamWriter - writes to a Kafka topic.
DeltaCDCToSnapshotWriter - writes the DataFrame in Delta format. It expects CDC events and performs merge logic and creates the latest snapshot table.
DeltaCDCToSCD2Writer - writes the DataFrame in Delta format. It expects CDC events and performs merge logic and creates SCD2 table.
HudiCDCToSCD2Writer - writes the DataFrame in Hudi format. It expects CDC events and performs merge logic and creates SCD2 table.

Custom components

Custom components can be implemented using the Component Archetype following the API defined in the package za.co.absa.hyperdrive.ingestor.api

A custom component has to be a class which extends either of the abstract classes StreamReader, StreamTransformer or StreamWriter
The class needs to have a companion object which implements the corresponding trait StreamReaderFactory, StreamTransformerFactory or StreamWriterFactory
The implemented components have to be packaged to a jar file, which can then be added to the classpath of the driver. To use a component, it has to be configured as described under Usage

After that, the new component will be able to be seamlessly invoked from the driver.

Usage

Hyperdrive has to be executed with Spark. Due to Spark-Kafka integration issues, it will only work with Spark 2.3 and higher.

How to run

git clone [email protected]:AbsaOSS/hyperdrive.git
mvn clean package

Given a configuration file has already been created, hyperdrive can be executed as follows:

spark-submit --class za.co.absa.hyperdrive.driver.drivers.PropertiesIngestionDriver driver/target/driver*.jar config.properties

Alternatively, configuration properties can also be passed as command-line arguments

spark-submit --class za.co.absa.hyperdrive.driver.drivers.CommandLineIngestionDriver driver/target/driver*.jar \
  component.ingestor=spark \
  component.reader=za.co.absa.hyperdrive.ingestor.implementation.reader.kafka.KafkaStreamReader \
  # more properties ...

Configuration

The configuration file may be created from the template located at driver/src/resources/Ingestion.properties.template.

CommandLineIngestionDriverDockerTest may be consulted for a working pipeline configuration.

General settings

Pipeline settings

Property Name	Required	Description
`component.ingestor`	Yes	Defines the ingestion pipeline. Only `spark` is currently supported.
`component.reader`	Yes	Fully qualified name of reader component, e.g.`za.co.absa.hyperdrive.ingestor.implementation.reader.kafka.KafkaStreamReader`
`component.transformer.id.{order}`	No	An arbitrary but unique string, referenced in this documentation as `{transformer-id}`
`component.transformer.class.{transformer-id}`	No	Fully qualified name of transformer component, e.g. `za.co.absa.hyperdrive.ingestor.implementation.transformer.column.selection.ColumnSelectorStreamTransformer`
`component.writer`	Yes	Fully qualified name of writer component, e.g. `za.co.absa.hyperdrive.ingestor.implementation.writer.parquet.ParquetStreamWriter`

Multiple transformers can be configured in the pipeline, including multiple instances of the same transformer. For each transformer instance, component.transformer.id.{order} and component.transformer.class.{transformer-id} have to specified, where {order} and {transformer-id} need to be unique. In the above table, {order} must be an integer and may be negative. {transformer-id} is only used within the configuration to identify which configuration options belong to a certain transformer instance.

Spark settings

Property Name	Required	Description
`ingestor.spark.termination.method`	No	Either `processAllAvailable` (stop query when no more messages are incoming) or `awaitTermination` (stop query on signal, e.g. Ctrl-C). Default: `awaitTermination`. See also Combination of trigger and termination method
`ingestor.spark.await.termination.timeout`	No	Timeout in milliseconds. Stops query when timeout is reached. This option is only valid with termination method `awaitTermination`

Settings for built-in components

KafkaStreamReader

Property Name	Required	Description
`reader.kafka.topic`	Yes	The name of the kafka topic to ingest data from. Equivalent to Spark property `subscribe`
`reader.kafka.brokers`	Yes	List of kafka broker URLs . Equivalent to Spark property `kafka.bootstrap.servers`

Any additional properties for kafka can be added with the prefix reader.option.. E.g. the property kafka.security.protocol can be added as reader.option.kafka.security.protocol

See e.g. the Structured Streaming + Kafka Integration Guide for optional kafka properties.

ParquetStreamReader

The parquet stream reader infers the schema from parquet files that already exist in the source directory. If no file exists, the reader will fail.

Property Name	Required	Description
`reader.parquet.source.directory`	Yes	Source path for the parquet files. Equivalent to Spark property `path` for the `DataStreamReader`

Any additional properties can be added with the prefix reader.parquet.options.. See Spark Structured Streaming Documentation

ConfluentAvroDecodingTransformer

The ConfluentAvroDecodingTransformer is built on ABRiS. More details about the configuration properties can be found there. Caution: The ConfluentAvroDecodingTransformer requires the property reader.kafka.topic to be set.

Property Name	Required	Description
`transformer.{transformer-id}.schema.registry.url`	Yes	URL of Schema Registry, e.g. http://localhost:8081. Equivalent to ABRiS property `SchemaManager.PARAM_SCHEMA_REGISTRY_URL`
`transformer.{transformer-id}.value.schema.id`	Yes	The schema id. Use `latest` or explicitly provide a number. Equivalent to ABRiS property `SchemaManager.PARAM_VALUE_SCHEMA_ID`
`transformer.{transformer-id}.value.schema.naming.strategy`	Yes	Subject name strategy of Schema Registry. Possible values are `topic.name`, `record.name` or `topic.record.name`. Equivalent to ABRiS property `SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY`
`transformer.{transformer-id}.value.schema.record.name`	Yes for naming strategies `record.name` and `topic.record.name`	Name of the record. Equivalent to ABRiS property `SchemaManager.PARAM_SCHEMA_NAME_FOR_RECORD_STRATEGY`
`transformer.{transformer-id}.value.schema.record.namespace`	Yes for naming strategies `record.name` and `topic.record.name`	Namespace of the record. Equivalent to ABRiS property `SchemaManager.PARAM_SCHEMA_NAMESPACE_FOR_RECORD_STRATEGY`
`transformer.{transformer-id}.consume.keys`	No	If set to `true`, keys will be consumed and added as columns to the dataframe. Key columns will be prefixed with `key__`
`transformer.{transformer-id}.key.schema.id`	Yes if `consume.keys` is true	The schema id for the key.
`transformer.{transformer-id}.key.schema.naming.strategy`	Yes if `consume.keys` is true	Subject name strategy for key
`transformer.{transformer-id}.key.schema.record.name`	Yes for key naming strategies `record.name` and `topic.record.name`	Name of the record.
`transformer.{transformer-id}.key.schema.record.namespace`	Yes for key naming strategies `record.name` and `topic.record.name`	Namespace of the record.
`transformer.{transformer-id}.keep.columns`	No	Comma-separated list of columns to keep (e.g. offset, partition)
`transformer.{transformer-id}.disable.nullability.preservation`	No	Set to true to ignore fix #137 and to keep the same behaviour as for versions prior to and including v3.2.2. Default value: `false`
`transformer.{transformer-id}.schema.registry.basic.auth.user.info.file`	No	A path to a text file, that contains one line in the form `<username>:<password>`. It will be passed as `basic.auth.user.info` to the schema registry config
`transformer.{transformer-id}.use.advanced.schema.conversion`	No	Set to true to convert the avro schema using AdvancedAvroToSparkConverter, which puts default value and underlying avro type to struct field metadata. Default false

For detailed information on the subject name strategy, please take a look at the Schema Registry Documentation.

Any additional properties for the schema registry config can be added with the prefix transformer.{transformer-id}.schema.registry.options.

Note: use.advanced.schema.conversion only works with a patched version of Spark, due to bug SPARK-34805. For the latest version of Spark, the patch is available in apache/spark#35270. For other versions of Spark, the changes need to be cherry-picked and built locally.

ConfluentAvroEncodingTransformer

The ConfluentAvroEncodingTransformer is built on ABRiS. More details about the configuration properties can be found there. Caution: The ConfluentAvroEncodingTransformer requires the property writer.kafka.topic to be set.

Property Name	Required	Description
`transformer.{transformer-id}.schema.registry.url`	Yes	URL of Schema Registry, e.g. http://localhost:8081. Equivalent to ABRiS property `SchemaManager.PARAM_SCHEMA_REGISTRY_URL`
`transformer.{transformer-id}.value.schema.naming.strategy`	Yes	Subject name strategy of Schema Registry. Possible values are `topic.name`, `record.name` or `topic.record.name`. Equivalent to ABRiS property `SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY`
`transformer.{transformer-id}.value.schema.record.name`	Yes for naming strategies `record.name` and `topic.record.name`	Name of the record. Equivalent to ABRiS property `SchemaManager.PARAM_SCHEMA_NAME_FOR_RECORD_STRATEGY`
`transformer.{transformer-id}.value.schema.record.namespace`	Yes for naming strategies `record.name` and `topic.record.name`	Namespace of the record. Equivalent to ABRiS property `SchemaManager.PARAM_SCHEMA_NAMESPACE_FOR_RECORD_STRATEGY`
`transformer.{transformer-id}.value.optional.fields`	No	Comma-separated list of nullable value columns that should get default value null in the avro schema. Nested columns' names should be concatenated with the dot (`.`)
`transformer.{transformer-id}.produce.keys`	No	If set to `true`, keys will be produced according to the properties `key.column.prefix` and `key.column.names` of the Hyperdrive Context
`transformer.{transformer-id}.key.schema.naming.strategy`	Yes if `produce.keys` is true	Subject name strategy for key
`transformer.{transformer-id}.key.schema.record.name`	Yes for key naming strategies `record.name` and `topic.record.name`	Name of the record.
`transformer.{transformer-id}.key.schema.record.namespace`	Yes for key naming strategies `record.name` and `topic.record.name`	Namespace of the record.
`transformer.{transformer-id}.key.optional.fields`	No	Comma-separated list of nullable key columns that should get default value null in the avro schema. Nested columns' names should be concatenated with the dot (`.`)
`transformer.{transformer-id}.schema.registry.basic.auth.user.info.file`	No	A path to a text file, that contains one line in the form `<username>:<password>`. It will be passed as `basic.auth.user.info` to the schema registry config
`transformer.{transformer-id}.use.advanced.schema.conversion`	No	Set to true to convert the avro schema using AdvancedSparkToAvroConverter, which reads default value and underlying avro type from struct field metadata. Default false

Any additional properties for the schema registry config can be added with the prefix transformer.{transformer-id}.schema.registry.options.

ColumnSelectorStreamTransformer

Property Name	Required	Description
`transformer.{transformer-id}.columns.to.select`	Yes	Comma-separated list of columns to select. `*` can be used to select all columns. Only existing columns using column names may be selected (i.e. expressions cannot be constructed)

AddDateVersionTransformer

The AddDateVersionTransformer adds the columns hyperdrive_date and hyperdrive_version. hyperdrive_date is the ingestion date (or a user-defined date), while hyperdrive_version is a number automatically incremented with every ingestion, starting at 1. For the auto-increment to work, hyperdrive_date and hyperdrive_version need to be defined as partition columns. Caution: This transformer requires a writer which defines writer.parquet.destination.directory.

Property Name	Required	Description
`transformer.{transformer-id}.report.date`	No	User-defined date for `hyperdrive_date` in format `yyyy-MM-dd`. Default date is the date of the ingestion

ColumnRenamingStreamTransformer

ColumnRenamingStreamTransformer allows renaming of columns specified in the configuration.

To add the transformer to the pipeline use this class name:

component.transformer.class.{transformer-id} = za.co.absa.hyperdrive.ingestor.implementation.transformer.column.renaming.ColumnRenamingStreamTransformer

Property Name	Required	Description
`transformer.{transformer-id}.columns.rename.from`	Yes	A comma-separated list of columns to rename. For example, `column1, column2`.
`transformer.{transformer-id}.columns.rename.to`	Yes	A comma-separated list of new column names. For example, `column1_new, column2_new`.

ColumnCopyStreamTransformer

ColumnCopyStreamTransformer allows copying of columns specified in the configuration. Dots in column names are interpreted as nested structs, unless they are surrounded by backticks (same as Spark convention)

Note that usage of the star-operator * within column names is not supported and may lead to unexpected behaviour.

To add the transformer to the pipeline use this class name:

component.transformer.class.{transformer-id} = za.co.absa.hyperdrive.ingestor.implementation.transformer.column.copy.ColumnCopyStreamTransformer

Property Name	Required	Description
`transformer.{transformer-id}.columns.copy.from`	Yes	A comma-separated list of columns to copy from. For example, `column1.fieldA, column2.fieldA`.
`transformer.{transformer-id}.columns.copy.to`	Yes	A comma-separated list of new column names. For example, `newColumn.col1_fieldA, newColumn.col2_fieldA`.

Example

Given a dataframe with the following schema

 |-- column1
 |    |-- fieldA
 |    |-- fieldB
 |-- column2
 |    |-- fieldA
 |-- column3

Then, the following column parameters

transformer.{transformer-id}.columns.copy.from=column1.fieldA, column2.fieldA
transformer.{transformer-id}.columns.copy.to=newColumn.col1_fieldA, newColumn.col2_fieldA

will produce the following schema

 |-- column1
 |    |-- fieldA
 |    |-- fieldB
 |-- column2
 |    |-- fieldA
 |-- column3
 |-- newColumn
 |    |-- col1_fieldA
 |    |-- col2_fieldA

DeduplicateKafkaSinkTransformer

DeduplicateKafkaSinkTransformer deduplicates records in a query from a Kafka source to a Kafka destination in a rerun after a failure. Records are identified across source and destination topic by a user-defined id, which may be a composite id and may include consumer record properties such as offset, partition, but also fields from the key or value schema. Deduplication is needed because the Kafka-destination provides only a at-least-once guarantee. Deduplication works by getting the ids from the last partial run in the destination topic and excluding them in the query.

Note that there must be only one source and one destination topic, and there must be only one writer writing to the destination topic, and no records must have been written to the destination topic after the partial run. Otherwise, records may still be duplicated.

To use this transformer, KafkaStreamReader, ConfluentAvroDecodingTransformer, ConfluentAvroEncodingTransformer and KafkaStreamWriter must be configured as well.

Note that usage of the star-operator * within column names is not supported and may lead to unexpected behaviour.

To add the transformer to the pipeline use this class name:

component.transformer.class.{transformer-id} = za.co.absa.hyperdrive.ingestor.implementation.transformer.deduplicate.kafka.DeduplicateKafkaSinkTransformer

Property Name	Required	Description
`transformer.{transformer-id}.source.id.columns`	Yes	A comma-separated list of consumer record properties that define the composite id. For example, `offset, partition` or `key.some_user_id`.
`transformer.{transformer-id}.destination.id.columns`	Yes	A comma-separated list of consumer record properties that define the composite id. For example, `value.src_offset, value.src_partition` or `key.some_user_id`.
`transformer.{transformer-id}.kafka.consumer.timeout`	No	Kafka consumer timeout in seconds. The default value is 120s.

The following fields can be selected on the consumer record

topic
offset
partition
timestamp
timestampType
serializedKeySize
serializedValueSize
key
value

In case of key and value, the fields of their schemas can be specified by adding a dot, e.g. key.some_nested_record.some_id or likewise value.some_nested_record.some_id

See Pipeline settings for details about {transformer-id}.

ParquetStreamWriter

Property Name	Required	Description
`writer.parquet.destination.directory`	Yes	Destination path of the sink. Equivalent to Spark property `path` for the `DataStreamWriter`
`writer.parquet.partition.columns`	No	Comma-separated list of columns to partition by.
`writer.parquet.metadata.check`	No	Set this option to `true` if the consistency of the metadata log should be checked prior to the query. For very large tables, the check may be very expensive
`writer.common.trigger.type`	No	See Combination writer properties
`writer.common.trigger.processing.time`	No	See Combination writer properties

Any additional properties for the DataStreamWriter can be added with the prefix writer.parquet.options, e.g. writer.parquet.options.key=value

KafkaStreamWriter

Property Name	Required	Description
`writer.kafka.topic`	Yes	The name of the kafka topic to ingest data from. Equivalent to Spark property `topic`
`writer.kafka.brokers`	Yes	List of kafka broker URLs . Equivalent to Spark property `kafka.bootstrap.servers`
`writer.common.trigger.type`	No	See Combination writer properties
`writer.common.trigger.processing.time`	No	See Combination writer properties

MongoDbStreamWriter

Property Name	Required	Description
`writer.mongodb.uri`	Yes	Output MongoDB URI, e.g. `mongodb://host:port/database.collection`.
`writer.mongodb.database`	No	Database name (if not specified as the part of URI).
`writer.mongodb.collection`	No	Collection name (if not specified as the part of URI).
`writer.common.trigger.type`	No	See Combination writer properties
`writer.common.trigger.processing.time`	No	See Combination writer properties

Any additional properties for the DataStreamWriter can be added with the prefix writer.mongodb.options, e.g. writer.mongodb.options.key=value

Common MongoDB additional options

Property Name	Default	Description
`writer.mongodb.option.spark.mongodb.output.ordered`	`true`	When set to `false` inserts are done in parallel, increasing performance, but the order of documents is not preserved.
`writer.mongodb.option.spark.mongodb.output.forceInsert`	`false`	Forces saves to use inserts, even if a Dataset contains `_id.`
More on these options: https://docs.mongodb.com/spark-connector/current/configuration

DeltaCDCToSnapshotWriter

Property Name	Required	Description
`writer.deltacdctosnapshot.destination.directory`	Yes	Destination path of the sink. Equivalent to Spark property `path` for the `DataStreamWriter`
`writer.deltacdctosnapshot.partition.columns`	No	Comma-separated list of columns to partition by.
`writer.deltacdctosnapshot.key.column`	Yes	A column with unique entity identifier.
`writer.deltacdctosnapshot.operation.column`	Yes	A column containing value marking a record with an operation.
`writer.deltacdctosnapshot.operation.deleted.values`	Yes	Values marking a record for deletion in the operation column.
`writer.deltacdctosnapshot.precombineColumns`	Yes	When two records have the same key value, we will pick the one with the largest value for precombine columns. Evaluated in provided order.
`writer.deltacdctosnapshot.precombineColumns.customOrder`	No	Precombine column's custom order in ascending order.
`writer.common.trigger.type`	No	See Combination writer properties
`writer.common.trigger.processing.time`	No	See Combination writer properties

Any additional properties for the DataStreamWriter can be added with the prefix writer.deltacdctosnapshot.options, e.g. writer.deltacdctosnapshot.options.key=value

Example

component.writer=za.co.absa.hyperdrive.compatibility.impl.writer.cdc.delta.snapshot.DeltaCDCToSnapshotWriter
writer.deltacdctosnapshot.destination.directory=/tmp/destination
writer.deltacdctosnapshot.key.column=key
writer.deltacdctosnapshot.operation.column=ENTTYP
writer.deltacdctosnapshot.operation.deleted.values=DL,FD
writer.deltacdctosnapshot.precombineColumns=TIMSTAMP, ENTTYP
writer.deltacdctosnapshot.precombineColumns.customOrder.ENTTYP=PT,FI,RR,UB,UP,DL,FD

DeltaCDCToSCD2Writer

Property Name	Required	Description
`writer.deltacdctoscd2.destination.directory`	Yes	Destination path of the sink. Equivalent to Spark property `path` for the `DataStreamWriter`
`writer.deltacdctoscd2.partition.columns`	No	Comma-separated list of columns to partition by.
`writer.deltacdctoscd2.key.column`	Yes	A column with unique entity identifier.
`writer.deltacdctoscd2.timestamp.column`	Yes	A column with timestamp.
`writer.deltacdctoscd2.operation.column`	Yes	A column containing value marking a record with an operation.
`writer.deltacdctoscd2.operation.deleted.values`	Yes	Values marking a record for deletion in the operation column.
`writer.deltacdctoscd2.precombineColumns`	Yes	When two records have the same key and timestamp value, we will pick the one with the largest value for precombine columns. Evaluated in provided order.
`writer.deltacdctoscd2.precombineColumns.customOrder`	No	Precombine column's custom order in ascending order.
`writer.common.trigger.type`	No	See Combination writer properties
`writer.common.trigger.processing.time`	No	See Combination writer properties

Any additional properties for the DataStreamWriter can be added with the prefix writer.deltacdctoscd2.options, e.g. writer.deltacdctoscd2.options.key=value

Example

component.writer=za.co.absa.hyperdrive.compatibility.impl.writer.cdc.delta.scd2.DeltaCDCToSCD2Writer
writer.deltacdctoscd2.destination.directory=/tmp/destination
writer.deltacdctoscd2.key.column=key
writer.deltacdctoscd2.timestamp.column=TIMSTAMP
writer.deltacdctoscd2.operation.column=ENTTYP
writer.deltacdctoscd2.operation.deleted.values=DL,FD
writer.deltacdctoscd2.precombineColumns=ENTTYP
writer.deltacdctoscd2.precombineColumns.customOrder.ENTTYP=PT,FI,RR,UB,UP,DL,FD

HudiCDCToSCD2Writer

Property Name	Required	Description
`writer.hudicdctoscd2.destination.directory`	Yes	Destination path of the sink. Equivalent to Spark property `path` for the `DataStreamWriter`
`writer.hudicdctoscd2.partition.columns`	No	Comma-separated list of columns to partition by.
`writer.hudicdctoscd2.key.column`	Yes	A column with unique entity identifier.
`writer.hudicdctoscd2.timestamp.column`	Yes	A column with timestamp.
`writer.hudicdctoscd2.operation.column`	Yes	A column containing value marking a record with an operation.
`writer.hudicdctoscd2.operation.deleted.values`	Yes	Values marking a record for deletion in the operation column.
`writer.hudicdctoscd2.precombineColumns`	Yes	When two records have the same key and timestamp value, we will pick the one with the largest value for precombine columns. Evaluated in provided order.
`writer.hudicdctoscd2.precombineColumns.customOrder`	No	Precombine column's custom order in ascending order.
`writer.common.trigger.type`	No	See Combination writer properties
`writer.common.trigger.processing.time`	No	See Combination writer properties

Any additional properties for the DataStreamWriter can be added with the prefix writer.hudicdctoscd2.options, e.g. writer.hudicdctoscd2.options.key=value

Example

component.writer=za.co.absa.hyperdrive.compatibility.impl.writer.cdc.hudi.scd2.HudiCDCToSCD2Writer
writer.hudicdctoscd2.destination.directory=/tmp/destination
writer.hudicdctoscd2.key.column=key
writer.hudicdctoscd2.timestamp.column=TIMSTAMP
writer.hudicdctoscd2.operation.column=ENTTYP
writer.hudicdctoscd2.operation.deleted.values=DL,FD
writer.hudicdctoscd2.precombineColumns=ENTTYP
writer.hudicdctoscd2.precombineColumns.customOrder.ENTTYP=PT,FI,RR,UB,UP,DL,FD

Common writer properties

Property Name	Required	Description
`writer.common.checkpoint.location`	Yes	Used for Spark property `checkpointLocation`. The checkpoint location has to be unique among different workflows.
`writer.common.trigger.type`	No	Either `Once` for one-time execution or `ProcessingTime` for micro-batch executions for micro-batch execution. Default: `Once`.
`writer.common.trigger.processing.time`	No	Interval in ms for micro-batch execution (using `ProcessingTime`). Default: 0ms, i.e. execution as fast as possible.

Behavior of Triggers

Trigger (`writer.common.trigger.type`)	Termination method (`ingestor.spark.termination.method`)	Runtime	Details
Once	AwaitTermination or ProcessAllAvailable	Limited	Consumes all data that is available at the beginning of the micro-batch. The query processes exactly one micro-batch and stops then, even if more data would be available at the end of the micro-batch.
Once	AwaitTermination with timeout	Limited	Same as above, but terminates at the timeout. If the timeout is reached before the micro-batch is processed, it won't be completed and no data will be committed.
ProcessingTime	ProcessAllAvailable	Only long-running if topic continuously produces messages, otherwise limited	Consumes all available data in micro-batches and only stops when no new data arrives, i.e. when the available offsets are the same as in the previous micro-batch. Thus, it completely depends on the topic, if and when the query terminates.
ProcessingTime	AwaitTermination with timeout	Limited	Consumes data in micro-batches and only stops when the timeout is reached or the query is killed.
ProcessingTime	AwaitTermination	Long-running	Consumes data in micro-batches and only stops when the query is killed.

Note 1: The first micro-batch of the query will contain all available messages to consume and can therefore be quite large, even if the trigger ProcessingTime is configured, and regardless of what micro-batch interval is configured. To limit the size of a micro-batch, the property reader.option.maxOffsetsPerTrigger should be used. See also http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
Note 2: It's possible to define a timeout for trigger Once. If the timeout is reached before the micro-batch is processed, it won't be completed and no data will be committed. Such a behavior seems quite unpredictable and therefore we don't recommend it.

See the Spark Documentation for more information about triggers.

Hyperdrive Context

HyperdriveContext is an object intended to be used by the components to share data. It is a key-value store, where the key is a string and the value can be of any type. The following context variables are currently used by the default implementation.

Name	Type	Description
key.column.prefix	String	If `ConfluentAvroDecodingTransformer` is configured to consume keys, it prefixes the key columns with `key__` such that they can be distinguished in the dataframe. If `key__` happens to be a prefix of a value column, a random alphanumeric string is used instead.
key.column.names	Seq[String]	If `ConfluentAvroDecodingTransformer` is configured to consume keys, it contains the original column names (without prefix) in the key schema.

Secrets Providers

AWS SecretsManager

Property Name	Required	Description
`secretsprovider.config.providers.<provider-id>.class`	Yes	The fully qualified class name of the secrets provider. is an arbitrary string. Multiple secrets providers can be configured by supplying multiple s
`secretsprovider.config.defaultprovider`	No	The of the secrets provider to be used by default
`secretsprovider.secrets.<secret-id>.options.secretname`	Yes	The Secret name of the secret in AWS Secrets Manager. is an arbitrary string. Multiple secrets can be configured by supplying multiple s
`secretsprovider.secrets.<secret-id>.options.provider`	No	The of the secrets provider to be used for this specific secret.
`secretsprovider.secrets.<secret-id>.options.readasmap`	No	Set to true if the secret should be interpreted as a json map, set to false if the value should be read as is. Default: true
`secretsprovider.secrets.<secret-id>.options.key`	No	If the secret should be read as a map, specify the key whose value should be extracted as the secret
`secretsprovider.secrets.<secret-id>.options.encoding`	No	Decodes the secret. Valid values: `base64`

The Secrets Provider will fill the configuration property secretsprovider.secrets.<secret-id>.secretvalue with the secret value. This configuration key will be available for string interpolation to be used by other configuration properties.

Example

secretsprovider.config.providers.awssecretsmanager.class=za.co.absa.hyperdrive.driver.secrets.implementation.aws.AwsSecretsManagerSecretsProvider
secretsprovider.secrets.truststorepassword.provider=awssecretsmanager
secretsprovider.secrets.truststorepassword.options.secretname=<the-secret-name>
secretsprovider.secrets.truststorepassword.options.key=<the-secret-key>
secretsprovider.secrets.truststorepassword.options.encoding=base64
reader.option.kafka.ssl.truststore.password=${secretsprovider.secrets.truststorepassword.secretvalue}

Other

Hyperdrive uses Apache Commons Configuration 2. This allows properties to be referenced, e.g. like so

transformer.[avro.decoder].schema.registry.url=http://localhost:8081
writer.kafka.schema.registry.url=${transformer.[avro.decoder].schema.registry.url}

Workflow Manager

Hyperdrive ingestions may be triggered using the Workflow Manager, which is developed in a separate repository: https://github.com/AbsaOSS/hyperdrive-trigger

A key feature of the Workflow Manager are triggers, which define when an ingestion should be executed and how it should be requested. The workflow manager supports cron-based triggers as well as triggers that listen to a notification topic.

How to build

Scala 2.12, Spark 2.4 (default)

mvn clean install

Scala 2.12, Spark 3.0

mvn clean install -Pscala-2.12,spark-3

Scala 2.11, Spark 2.4

mvn scala-cross-build:change-version -Pscala-2.11,spark-2
mvn clean install -Pscala-2.11,spark-2
mvn scala-cross-build:restore-version

E2E tests with Docker

E2E tests require a running Docker instance on the executing machine and are not executed by default. To execute them, build using the profile all-tests

mvn clean test -Pall-tests

How to measure code coverage

./>mvn clean install -Pscala-2.ZY,spark-Z,code-coverage

If module contains measurable data the code coverage report will be generated on path:

{project-path}\hyperdrive\{module}\target\site\jacoco

hyperdrive's People

Contributors

Stargazers

Watchers

Forkers

xraya4t kevinwallimann magantivenkat sathya-reddy-m sbrichardson linhhuynh88 feihu618 ashishtiwary samklr khileshchauhan peterwainah jozefbakus ganeshchand

hyperdrive's Issues

Refactor SparkIngestor to a configurable component

Currently, the SparkIngestor doesn't accept any configuration. This will be needed for long-running jobs to choose either StreamingQuery.awaitTermination or StreamingQuery.processAllAvailable.

Additionally, it's currently not possible to pass configuration options to the sparksession via the config file or command line

Tasks

Make SparkIngestor a class that accepts a SparkSession and Configuration which is instantiated by a companion object, much like the other components e.g. KafkaStreamReader etc.
Pass all options with prefix ingestor.spark.options. on to the SparkSession

Update Readme

Update readme with descriptions for spark summit, and add documentation for components

Express ConfluentAvroKafkaDecoder as transformer

KafkaStreamReader.read should return DataFrame
ConfluentAvroKafkaDecoder.decode could then accept DataFrame and return DataFrame, making it a Transformer, e.g. za.co.absa.hyperdrive.ingestor.implementation.transformer.avro.ConfluentAvroDecoderTransformer

Then, the Decoder component is not needed anymore.

How to migrate Hyperdrive-Trigger

Replace

"component.decoder=za.co.absa.hyperdrive.ingestor.implementation.decoder.avro.confluent.ConfluentAvroKafkaStreamDecoder"

with

"component.transformer.id.0=confluent.avro.decoder", "component.transformer.class.confluent.avro.decoder=za.co.absa.hyperdrive.ingestor.implementation.transformer.avro.ConfluentAvroDecoderTransformer"

Replace

decoder.avro

with

transformer.confluent.avro.decoder

Refactor ParquetPartitioningStreamWriter to transformer

ParquetPartitioningStreamWriter does two things: It adds two columns (i.e. transformation) and writes the dataframe partitioned (special write). With #116 the two responsibilities can be separated: ParquetStreamWriter is enhanced to write partitioned. Thus, only the transformation is left for ParquetPartitioningStreamWriter.

Tasks

Refactor ParquetPartitioningStreamWriter to a transformer and rename
Merge AbstractParquetStreamWriter with ParquetStreamWriter

How to migrate Hyperdrive-Trigger

Replace

"component.writer=za.co.absa.hyperdrive.ingestor.implementation.writer.parquet.ParquetPartitioningStreamWriter"

with

"component.transformer.id.2=add.date.version", "component.transformer.class.add.date.version=za.co.absa.hyperdrive.ingestor.implementation.transformer.add.dateversion.AddDateVersionTransformer",
"component.writer=za.co.absa.hyperdrive.ingestor.implementation.writer.parquet.ParquetStreamWriter"

Replace

writer.parquet.partitioning.report.date

with

transformer.add.date.version.report.date

Replace

"writer.parquet.destination

with

"transformer.add.date.version.destination=${writer.parquet.destination}", "writer.parquet.partition.columns=hyperdrive_date, hyperdrive_version", "writer.parquet.destination

Make sure there is no workflow using ParquetPartitioning and partition columns at the same time

Clean up shared module

Currently, not all classes in module shared are really used in multiple modules. With #99 the module needs to be published to maven. To keep its footprint small, only classes whose responsibilities span across multiple modules should be kept in shared. All other classes should be scoped package private on za.co.absa.hyperdrive.

Tasks

ConfigurationKeys: Move to ingestor-default package, only move object IngestorKeys to driver module.
IngestionException and IngestionStartException: Move to driver module
~~FileUtils, ConfigUtils should be moved to ingestor-default package. Also merge it with the SchemaRegistrySettingsUtil from #108 . They should be made package private on za.co.absa.hyperdrive~~
SparkUtils should be removed since it has no usages. With it, also TestSparkUtils should be removed. Then, the dependency on testutils can be removed. testutils can be removed altogether.

Remove mandatory destination directory

Not every writer will have a destination directory, e.g. Kafka.

However, this line is making it mandatory.

Remove unreachable code in KafkaStreamReader

Currently, the else-block in the following snippet in KafkaStreamReader is unreachable.

    val optionalKeys = getKeysFromPrefix(configuration, rootFactoryOptionalConfKey)

    val extraConfs = optionalKeys.foldLeft(Map[String,String]()) {
      case (map,securityKey) =>
        getOrNone(securityKey, configuration) match {
          case Some(value) => map + (tweakOptionKeyName(securityKey) -> value)
          case None => map
        }
    }

    if (extraConfs.isEmpty || extraConfs.size == optionalKeys.size) {
      extraConfs
    }
    else {
      logger.warn(s"Assuming no security settings, since some appear to be missing: {${findMissingKeys(optionalKeys, extraConfs)}}")
      Map[String,String]()
    }

optionalKeys gets keys from the configuration and extraConfs gets the optionalKeys again from configuration. configuration does not change between the two calls. Therefore, extraConfs will always have the same size as optionalKeys

Moreover, even if some settings were missing, they should not be removed, as this could lead to unexpected error messages downstream.
Therefore, the if-else-block should be removed.

Don't use multiple = characters to encode extra configuration

Currently, extra configuration (with keys only known at runtime) is provided as follows:
writer.parquet.extra.conf.1=key1=value1

This format is not intuitive. Also the .1 in the key does not convey any information. Therefore, the extra configuration should be changed to
writer.parquet.option.key=value

To that end, Configuration.subset and Configuration.keys might be used

Unfortunately, this is inconsistent with reader.option.key=value, but it's consistent with all other configuration properties which include an identifier for the component. Arguably, reader.option.key=value should be changed to reader.kafka.option.key=value even though this results in properties like reader.kafka.option.kafka.security.protocol

Breaking changes
Configuration property naming pattern for extra configuration changes from writer.parquet.extra.conf.1=key1=value1 to writer.parquet.options.key1=value1

Fix inconsistencies between CommandLineIngestorDriver and PropertiesIngestorDriver

Fix inconsistencies:

In CommandLineIngestorDriver whitespace in key and value is not trimmed, while it is trimmed in PropertiesIngestorDriver. E.g. key1 = value1 will be stored as key1 (with whitespace at end) with CommandLineIngestorDriver, but as key1 in PropertiesIngestorDriver. The behavior of PropertiesIngestorDriver is the expected one.
An empty property, i.e. some.key= causes an exception in CommandLineIngestorDriver, but is allowed in PropertiesIngestorDriver. This is inconsistent. Either it should always cause an exception, or it should be allowed. Probably the behavior of PropertiesIngestorDriver should be favoured since it directly uses the behaviour of Configuration2. Users familiar with Configuration2 would find a differing behaviour unexpected.

Add Jenkinsfile to the project

Add basic Jenkinsfile to the project so we can do automatic builds on a creation of a PR
Consult @hamc17

def hyperdriveSlaveLabel = getHyperdriveSlaveLabel()
def toolVersionJava = getToolVersionJava()
def toolVersionMaven = getToolVersionMaven()
def toolVersionGit = getToolVersionGit()
def mavenSettingsId = getMavenSettingsId()

pipeline {
    agent {
        label "${hyperdriveSlaveLabel}"
    }
    tools {
        jdk "${toolVersionJava}"
        maven "${toolVersionMaven}"
        git "${toolVersionGit}"
    }
    options {
        buildDiscarder(logRotator(numToKeepStr: '20'))
        timestamps()
    }
    stages {
        stage ('Build') {
            steps {
                configFileProvider([configFile(fileId: "${mavenSettingsId}", variable: 'MAVEN_SETTINGS_XML')]) {
                    sh "mvn -s $MAVEN_SETTINGS_XML clean package"
                }
            }
        }
    }
}

Replace File with Path

In many places, File is used instead of Path. While File is not deprecated, Path is the successor of File and should be preferred.

https://docs.oracle.com/javase/tutorial/essential/io/legacy.html

Replace SparkTestBase with the one from absa-commons

Replace occurrences of za.co.absa.hyperdrive.testutils.SparkTestBase with za.co.absa.commons.spark.SparkTestBase

Define default error message

The methods getOrThrow and getSeqOrThrow in ConfigUtils might throw an IllegalArgumentException without error message. This may lead to errors that are hard to understand.

Sensible default messages should be provided.

Refactor ConfigurationKeys

Remove ConfigurationKeys. Move all objects to the newly created ...Attributes objects to break the coupling induced by ConfigurationKeys

Configuration property keys of components should be accessible via reflection

Currently, the configuration property keys are hard-coded in the ingestor (or any custom component jar). This makes it impossible to infer the required configuration properties, given just the jars. It should be possible that an external program (e.g. the hyperdrive-trigger) can read the available configuration properties of a component (through reflection) if the jars are given.

To that end, every ComponentFactory should implement a method that returns a list of configuration properties. Furthermore, this list should also contain information whether a configuration property is required or optional, and some validation rules.
Some components may depend on each other, e.g. the CheckpointOffsetManager requires the KafkaStreamReader to be configured (or at least the property reader.kafka.topic must be defined). This should be taken into account as well.

Possible breakdown of this issue:

List of config properties with required/optional info.
Add validation rules to list
Provide e.g. another method to define dependencies on other components (multiple components, either component A or component B,...)

Migration from older versions
Classes that implemented StreamReaderFactory, OffsetManagerFactory, StreamDecoderFactory, StreamTransformerFactory or StreamWriterFactory must now implement the trait HasComponentAttributes

Remove retention policy

With ABRiS 3.1.0, the retention policy is removed from the library. SchemaRetentionPolicies are not needed for version 3 since you can easily select what you need using standard Spark functions.

Already now, the retention policy is never used, even though it is a mandatory configuration property. Therefore, any retention policy configuration should be removed.

Remove getSchemaRetentionPolicy from ConfluentAvroKafkaStreamDecoder
Remove KEY_SCHEMA_RETENTION_POLICY from ConfigurationKeys
Remove decoder.avro.schema.retention.policy from Ingestion.properties.template

test-issue

somethign

Add KafkaWriter

Driver - if no properties file found, use the template

I suggest that if the compiler does not find any properties file in the resources of the driver module, to copy over the template which, if I understand it correctly should be usable for the basic use case of the hyperdrive and maybe to demonstrate how it works.

This is done with modules within Enceladus as well from a pom file. The plugin used is maven-antrun-plugin please check here

Move CheckpointOffsetManager logic to Reader and Writer

The StreamManager has two methods with signatures

  def configure(streamReader: DataStreamReader, configuration: Configuration): DataStreamReader

  def configure(streamWriter: DataStreamWriter[Row], configuration: Configuration): DataStreamWriter[Row]

It's hard to imagine any other use case than the checkpoint location which would justify having the concept of a stream manager which configures solely reader and writer (but not the transformer for example). In fact, the implementation of the CheckpointOffsetManager can hardly be reused for a different data source since the concept of starting offsets is tightly coupled to kafka. Moreover, for the reader config, the checkpoint location is only needed to determine the starting offsets, which is an implementation detail. Arguably, it may be surprising to the developer that the starting offsets are not set in the KafkaStreamReader but in the CheckpointOffsetManager
For all these reasons, I believe the extra indirection from having a StreamManager concept is not justified.

manager.checkpoint.base.location can be replaced by reader.kafka.checkpoint.base.location and writer.kafka.base.location and writer.parquet.base.location Another possibility is to replace it by a "top-level" property spark.ingestor.checkpoint.base.location since the checkpoint location is mandatory for every structured streaming query (see https://github.com/apache/spark/blob/695cb617d42507eded9c7e50bc7cd5333bbe6f83/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L259)

Decision
Replace manager.checkpoint.base.location by writer.common.checkpoint.base.location because the checkpoint location is a config property for the DataStreamWriter per the Spark documentation. It's least surprising to add it as a writer property here as well. Unfortunately, the reader will have to depend on this property as well.
Also, rename checkpoint.base.location to checkpoint.location, since the concept of a base location has been given up in #85

Migration for Trigger
Replace

manager.checkpoint.base.location

with

writer.common.checkpoint.location

Replace

"component.manager=za.co.absa.hyperdrive.ingestor.implementation.manager.checkpoint.CheckpointOffsetManager",

with
(empty string)

Write E2E test running in cluster mode

Some issues only occur in cluster mode and not in client mode, e.g. serialization issues. To catch these errors, a e2e test (probably using Docker) should be created that sets up a cluster and runs the ingestor in cluster mode.

Add Trigger.ProcessingTime to Writers

Currently, all jobs are ingested with Trigger.Once, i.e. all data is ingested into one parquet file (per kafka partition). Certain jobs may produce very large output files, leading to out of memory errors.

To prevent this, Trigger.ProcessingTime should be used.

New configuration property: writer.parquet.trigger
The expected value is the number of milliseconds.
If the value is not a number or the property is not present, all data should be ingested at once, as it is the case now.

The change should be available for both ParquetStreamWriter and ParquetPartitioningStreamWriter

Preserve nullability from Avro to Catalyst Schema

Currently, in a Kafka-to-Kafka (i.e. Avro -> Catalyst -> Avro) workflow (with columnselectortransformer), all fields are always nullable in the destination topic.
Example:
Source schema

{
  "type": "record",
  "name": "pageviews",
  "namespace": "ksql",
  "fields": [
    {
      "name": "viewtime",
      "type": "long"
    }
  ]
}

is written as

{
  "type": "record",
  "name": "pageviews",
  "namespace": "ksql",
  "fields": [
    {
      "name": "viewtime",
      "type": ["long", "null"]
    }
  ]
}

Expected: Non-nullable fields in the source Avro schema should also be non-nullable in the destination.
Nullable fields should stay nullable obviously.

Migration note
Making an existing nullable field non-nullable is a forward-compatible change (it's almost like adding a field)

Extract encoding part of KafkaStreamWriter to transformer component

KafkaStreamWriter should not be dependent on Abris (confluent and Avro)
Thus, the encoding part should be extracted to a preceding transformer component

A new transformer should be created, e.g. za.co.absa.hyperdrive.ingestor.implementation.transformer.avro.ConfluentAvroEncoderTransformer

Breaking changes
Configuration properties will need to be adjusted

How to migrate Hyperdrive-Trigger

Replace

"component.writer

with

"component.transformer.id.2=confluent.avro.encoder", "component.transformer.class.confluent.avro.encoder=za.co.absa.hyperdrive.ingestor.implementation.transformer.avro.ConfluentAvroEncoderTransformer",
"component.writer

Replace

writer.kafka.schema

with

transformer.confluent.avro.encoder.schema

Replace

writer.kafka.value

with

transformer.confluent.avro.encoder.value

Replace

writer.kafka.produce.keys

with

transformer.confluent.avro.encoder.produce.keys

Replace

writer.kafka.key

with

transformer.confluent.avro.encoder.key

Replace

writer.kafka.option

with

transformer.confluent.avro.encoder.option

NoClassDefFoundError AbstractKafkaAvroSerDeConfig

Running the ingestor fails as of 01a1ecd with the following error message

13:26:56.095 [main] INFO  za.co.absa.hyperdrive.ingestor.implementation.manager.factories.OffsetManagerAbstractFactory$ - Going to load factory for configuration 'component.manager'.
13:26:56.100 [main] INFO  za.co.absa.hyperdrive.ingestor.implementation.manager.checkpoint.CheckpointOffsetManager$ - Going to create CheckpointOffsetManager instance using: topic='clickstream', checkpoint base location='/tmp/checkpoint-location'
13:26:56.102 [main] INFO  za.co.absa.hyperdrive.ingestor.implementation.decoder.factories.StreamDecoderAbstractFactory$ - Going to load factory for configuration 'component.decoder'.
Exception in thread "main" java.lang.NoClassDefFoundError: io/confluent/kafka/serializers/AbstractKafkaAvroSerDeConfig
	at za.co.absa.hyperdrive.ingestor.implementation.decoder.avro.confluent.ConfluentAvroKafkaStreamDecoder$.apply(ConfluentAvroKafkaStreamDecoder.scala:72)
	at za.co.absa.hyperdrive.ingestor.implementation.decoder.factories.StreamDecoderAbstractFactory$.build(StreamDecoderAbstractFactory.scala:39)
	at za.co.absa.hyperdrive.driver.IngestionDriver.getStreamDecoder(IngestionDriver.scala:63)
	at za.co.absa.hyperdrive.driver.IngestionDriver.ingest(IngestionDriver.scala:48)
	at za.co.absa.hyperdrive.driver.drivers.PropertiesIngestionDriver$.main(PropertiesIngestionDriver.scala:49)
	at za.co.absa.hyperdrive.driver.drivers.PropertiesIngestionDriver.main(PropertiesIngestionDriver.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: io.confluent.kafka.serializers.AbstractKafkaAvroSerDeConfig
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 18 more
19/11/28 13:26:56 INFO SparkContext: Invoking stop() from shutdown hook

io/confluent/kafka/serializers/AbstractKafkaAvroSerDeConfig is part of kafka-avro-serializer and was added as a test dependency for PR #59

The line in which it fails is
PARAM_SCHEMA_REGISTRY_URL -> getOrThrow(KEY_SCHEMA_REGISTRY_URL, configuration, errorMessage = s"Schema Registry URL not specified. Is '$KEY_SCHEMA_REGISTRY_URL' configured?")
Oddly enough, there is no reference to AbstractKafkaAvroSerDeConfig in the whole project.

Publish key from kafka source as key to kafka sink / HyperdriveContext

Currently, the key from a kafka source is not ingested in ConfluentAvroKafkaStreamDecoder. Also, no key is produced to the kafka sink in KafkaStreamWriter. It should be possible to publish the ingested keys along with the value.

Consuming keys is supported by Abris like so

val result: DataFrame  = dataFrame.select(
    from_confluent_avro(col("key"), keyRegistryConfig) as 'key,
    from_confluent_avro(col("value"), valueRegistryConfig) as 'value)

Tasks

Add config properties for decoder
- decoder.avro.consume.keys (true / false)
- decoder.avro.key.schema.naming.strategy
- decoder.avro.key.schema.id
- decoder.avro.key.schema.record.name
- decoder.avro.key.schema.record.namespace
Add config properties for writer
- writer.kafka.produce.keys (true / false)
- writer.kafka.key.schema.naming.strategy
- writer.kafka.key.schema.record.name
- writer.kafka.key.schema.record.namespace
Add context object (HyperdriveContext) to which a value can be put and retrieved by key
- To avoid key collisions, used keys should be documented
- ConfluentAvroKafkaStreamDecoder should put the key column names, KafkaStreamWriter should retrieve them

Preserve avro schema from Avro to Catalyst and from Catalyst to Avro

Background

Currently, the org.apache.spark.sql.avro.SchemaConverters class is used to derive an avro type from a catalyst type. However, in a Avro-> Catalyst-> Avro query, this conversion is lossy. For example, information such as the default value, documentation or any custom json field in the avro schema is lost when converting to catalyst, and e.g. avro types BYTES and FIXED are both converted to the same catalyst type DecimalType (or BinaryType if no avro logical type is present).

For example, a BYTES type with logical type decimal

    "type" : {
      "type" : "bytes",
      "logicalType" : "decimal",
      "precision" : 8,
      "scale" : 3
    }

is converted to the Spark type DecimalType, which is in turn converted to the Avro type

    "type" : {
      "type" : "fixed",
      "name" : "fixed",
      "namespace" : "topLevelRecord.fieldName",
      "size" : 10,
      "logicalType" : "decimal",
      "precision" : 8,
      "scale" : 3
    }

Furthermore, default values in the source Avro schema are discarded in the Avro -> Catalyst conversion. This has two consequences: 1) Obviously, there's no way to generate a target Avro schema with that default value, and 2) a nullable type in avro is expressed as a union with null, where null is the first type in the union if and only if the default value is null. However, if the default value is unknown, there's no way to determine whether null should be the first or second type in the union.

For example, when a record schema with a default value

{
  "type" : "record",
  "name" : "topLevelRecord",
  "fields" : [ {
    "name" : "stringCol",
    "type" : [ "string", "null" ],
    "default" : "abcd"
  } ]
}

is converted to a StructType, the default value is lost

StructType(Seq(StructField("stringCol", StringType, nullable = true)))

and when the type is converted to an Avro type, the null type is in front of the string type in the union.

{
  "type" : "record",
  "name" : "topLevelRecord",
  "fields" : [ {
    "name" : "stringCol",
    "type" : [ "null", "string" ]
  } ]
}

Feature

There are two approaches.

The source schema could be used as an input to generate the target schema. However, since the Spark schema can change due to any number of spark transformations, renamings, column additions, there's no generic straightforward way to map a field in the source schema to a field in the target schema. A heuristic approach would be needed to decide which fields of the input Avro schema correspond to the fields of the output Avro schema.
The metadata object of Spark's StructField can be used to transport the information from the source Avro schema into the Spark schema and from there to the target Avro schema. This only works for Avro schemas, where the root type is a record type, i.e. it doesn't work if the Avro schema is just a map or an array, for example, because these would not be wrapped around by a StructField, but directly to a MapType or ArrayType, which don't have a metadata field, however.

Other

This issue only deals with default values and logical types. It may be extended to also support custom fields in the Avro schema in a separate issue.
See also #137
https://issues.apache.org/jira/browse/SPARK-28008

Add override modifiers

Add override modifiers in archetype and components as well

Create profile to run integration tests separately

Currently, no distinction is made among tests and all tests run with every build.

Now, a profile for integration tests should be created such that integration tests don't run with mvn clean test. The goal is to decrease the compile time while still discover errors early thanks to the unit tests.

For now, all tests with a SparkTestBase should be marked as integration tests. It's absolutely essential that integration tests run on Jenkins builds!

Remove nulls

There are many null checks for arguments in the codebase, e.g.

  override def write(dataFrame: DataFrame, streamManager: StreamManager): StreamingQuery = {
    if (dataFrame == null) {
      throw new IllegalArgumentException("Null DataFrame.")
    }

Since null should not be used in scala, an occurrence of null is a serious problem which cannot be handled. Throwing an IllegalArgumentException instead of letting the program fail with a NullPointerException adds little information.

Task

Remove null checks
Wrap null values coming from Java code in Option

Identify component factories by identifier string

Description
Currently, component factories are loaded in ClassLoaderUtils given their fully qualified classnames. The classname is passed by the configuration. (e.g. component.writer)

That means that components don't have the possibility to be refactored (renaming, moving to a different package) without introducing a breaking change which would require updating any existing configuration that uses that component.

Tasks

Add a method getIdentifier: String to the interface ComponentFactory. getClass.getName may be used as a default value. (so this feature won't be a breaking change)
Implementing components are responsible for providing a unique identifier. It's advisable to prefix the identifier with a human readable name, because it will be referenced in the configuration, logged, etc..
Use getIdentifier to load the factory in ClassLoaderUtils. Currently, it loads the class directly given the class name. This approach doesn't work to efficiently load the factory by the identifier. With #83, component factories can be loaded using the Service Provider Interface (SPI), i.e. with ServiceLoader. All factories expose the getIdentifier method, that's how it can be found

Other

The same identifier might be used by each component to prefix its configuration properties to avoid name clashes.

Upgrade ABRiS version to 3.1.0

Support for multiple transformers

Currently, only one transformer can be specified.

It's likely that there might be a use-case in the future which requires multiple chainable transformers.

~~The configuration parameter component.transformer could take a comma-separated list instead of just one value. The order of the list would specify the order of execution of the transformers.~~ That wouldn't support the same transformer to be used multiple times.

Components should be configured like this:

component.transformer.id.1=[id]
component.transformer.class.[id]=za.co.absa.hyperdrive.ingestor.my.transformer
transformer.[id].property.a = "value a"
transformer.[id].property.b = "value b"

component.transformer.id.2=csst
component.transformer.class.csst=za.co.absa.hyperdrive.ingestor.implementation.transformer.column.selection.ColumnSelectorStreamTransformer
transformer.csst.columns.to.select=*

component.transformer.id.3=csst2
component.transformer.class.csst2=za.co.absa.hyperdrive.ingestor.implementation.transformer.column.selection.ColumnSelectorStreamTransformer
transformer.csst2.columns.to.select="special_column"

Why the prefixes component.transformer.class and transformer? This prevents name conflicts

At runtime, transformers would only receive their specific config subset, i.e. in the above example, my-transformer gets property.a => "value a", property.b => "value b" in the transform method instead of the full configuration like now. If cross-component configuration is necessary, the HyperdriveContext may be utilized.

The order of the transformers is determined by the number after component.transformer.id. An error is thrown if it's not an integer. The number may be negative. Order numbers need not be consecutive, i.e. no error is thrown if one transformer has component.transformer.id.2 and the other component.transformer.id.-1

Tasks

StreamTransformerAbstractFactory.build should return a list of transformers.
build should call the apply method of the companion object only with a configuration subset using the id (e.g. csst in the above example)
SparkIngestor.ingest should accept a list of transformers and loop through them (fold)
Update tests

Note

As a by-product, transformers will be optional. If no transformer is specified, the list of transformers will be empty, thus the dataframe will directly be passed from the decoder (reader from v4.0.0) to the writer.

How to migrate
ColumnSelectorStreamTransformer

ConfigurationsKeys.ColumnSelectorStreamTransformerKeys.KEY_COLUMNS_TO_SELECT should be "columns.to.select"
Existing configuration in the Trigger DB

Replace

"component.transformer=za.co.absa.hyperdrive.ingestor.implementation.transformer.column.selection.ColumnSelectorStreamTransformer"

with

"component.transformer.id.1=column.selector", "component.transformer.class.column.selector=za.co.absa.hyperdrive.ingestor.implementation.transformer.column.selection.ColumnSelectorStreamTransformer"

Replace

transformer.columns.to.select=

with

transformer.column.selector.columns.to.select=

Alternatively, the whole column selector transformation config can be removed if all of the jobs only use select all.

HyperConformance

In za.co.absa.enceladus.conformance.HyperConformanceAttributes, search and replace s"$keysPrefix. with "
Existing configuration in the Trigger DB
Replace

"component.transformer=za.co.absa.enceladus.conformance.HyperConformance"

with

"component.transformer.id.1=hyperconformance","component.transformer.class.hyperconformance=za.co.absa.enceladus.conformance.HyperConformance"

Transformer specific configuration already happens to be correct

Remove Finalizer from API

The Finalizer component was added to the API in 3e442c0. Its intended use-case was to copy the ingested data to another folder (raw / publish). However, this has been solved differently using a separate jar.
Moreover, the Finalizer may break the exactly-once fault-tolerance of the ingestor as a whole.
Since the Finalizer was not part of the API in version 1.0.0, removing the finalizer does not break backward-compatibility.

Make component constructors private

Currently, the components have public constructors, but it is only used from the companion object. The constructors should be made private. Then, the tests for the class and the object can also be merged.

Files

ConfluentAvroKafkaStreamDecoder
TestConfluentAvroKafkaStreamDecoder
TestConfluentAvroKafkaStreamDecoderObject
CheckpointOffsetManager
etc.

List delimiter does not work for CommandLineIngestor

Using the CommandLineIngestor a comma-delimited configuration value cannot be extracted as an array using getStringArray because no list delimiter handler is set for the configuration (as opposed to the PropertiesIngestionDriver

Fix it.

Refactor TempDir

za.co.absa.hyperdrive.testutils.TempDir may be refactored easily with Files.createTempDir.

Consider using commons.io.TempDirectory from absa-commons instead of TempDir and Files.createTempDir

Move dependencies to child poms

Currently, many dependencies are defined in the parent pom even though they are only used in one module. That unnecessarily bloats up the jars of the other modules. Moreover, it's hard to track which module really needs a dependency

Task

Dependencies should be declared in the child poms, unless the dependency is used throughout all submodules
Dependency management (version and scope) should still be done at the parent level.

ParquetStreamWriter shouldn't write if metadata is inconsistent

Problem description
When reading, Spark does not consider the metadata log if you read with a globbed path (e.g. /root-dir/*) or from a partitioned sub-directory (e.g. /root-dir/partition1=value1). Downstream applications are therefore at risk to read duplicated values in case of application failures and restarts.
Two cases for inconsistent metadata logs can be distinguished

Metadata log contains files that are not on the filesystem: Most likely, parquet files have been deleted / moved manually.
Parquet files are present which are not in metadata log: Most likely, this is due to a previous partial write. The parquet files should be removed.

In case 1), Spark will throw a FileNotFoundException in the next write. However in case 2), Spark does not throw any exception because this case is not an error from Spark's perspective.

Proposed solution

In the ParquetStreamWriter, the metadata log should be inspected and compared with the filesystem before writing. If it is inconsistent, a warning will be logged which lists the files to be deleted.
No automatic cleanup is considered, because partial writes are assumed to occur only rarely (even more rarely with https://issues.apache.org/jira/browse/SPARK-27210) and more importantly, automatic cleanups could result in inadvertent deletions if the metadata log has been tampered with.
An option to skip the check should be added, if the user knows what he does

This solution guarantees deduplicated reads for globbed paths and partitioned subdirectory reads, but doesn't guarantee atomicity, i.e. partial writes will be visible to downstream applications, but they will not be duplicated by subsequent writes.

Other

After #82 has been implemented, the functionality could be extracted to a transformer instead of having a switch on/off flag for the ParquetStreamWriter

OffsetManager should not expect topic as argument

Description
Currently, the base class OffsetManager expects a topic as a constructor argument. This is too specific to the use case of the CheckpointOffsetManager which currently requires the kafka topic (will be resolved in #85 ). In fact, this component does not need to manage offsets, but may manage anything which concerns both DataStreamReader and DataStreamWriter.

Tasks

OffsetManager should not require topic as a constructor argument.
OffsetManager should be renamed to StreamManager
The method OffsetManager.configureOffsets should be renamed to configure
Update archetype
Update readmes

How to migrate

Components implementing OffsetManager need to import StreamManager, rename configureOffsets to configure and remove the constructor argument.
No configuration properties need to be changed.

Property manager.checkpoint.base.location should contain complete path to checkpoint-location

Description
Currently, the checkpoint directory for a workflow is created as ${manager.checkpoint.base.location}/${reader.kafka.topic}. This makes no sense if the reader is not a kafka reader, but e.g. reading from a jdbc source.

Now, the ingestor should assume that the complete path is stored in${manager.checkpoint.base.location}. A complete path is also used in ${writer.parquet.destination.directory}

The user has to make sure that the checkpoint path is unique among workflows.

Consequences
Currently, an exception is thrown if manager.checkpoint.base.location does not exist. It's impossible to keep this behavior because startingOffsets is set to earliest if the resolved checkpoint location (base dir + topic) does not exist. After this PR, if manager.checkpoint.base.location is empty, no exception should be thrown, but startingOffsets should be set to earliest.

How to migrate
For all existing workflows, the property manager.checkpoint.base.location has to be appended with /${reader.kafka.topic}

Rename prefix for additional properties for KafkaStreamReader

Currently, additional properties for the KafkaStreamReader have to be specified with the prefix reader.options. E.g. reader.options.kafka.security.protocol or reader.options.kafka.ssl.key.password

This prefix is inconsistent with all other properties which start with decoder.avro., writer.parquet., manager.checkpoint or transformer.columns.. The properties for the KafkaStreamReader should start with reader.kafka., i.e. reader.kafka.options.kafka.security.protocol.

For additional properties like reader.options.failOnDataLoss or reader.options.minPartitions it's hard to find out which reader implementation they belong to.

Tasks

Change prefix for additional properties for KafkaStreamReader to reader.kafka.options.

How to migrate

All property keys starting with reader.options have to be replaced by reader.kafka.options

StreamWriter should not require destination as constructor argument

Description
Currently, the StreamWriter requires a destination as a constructor argument. However, e.g. writing to Kafka does not require a destination, but rather a topic.

Tasks

Remove the constructor argument destination from StreamWriter
Remove the default implementation of StreamWriter.getDestination and implement it in the derived classes.
Update archetype

Consequences

Currently, the destination directory is removed if the ingestion fails and the directory has been empty before the ingestion. This functionality assumes that the getDestination method of the StreamWriter returns a path on hdfs. However, this cannot be guaranteed. getDestination may return any string and it can't be assumed that it signifies a path.
The functionality is not very useful in practice since the destination folder is empty only at the very first ingestion. Therefore, this functionality will be removed.

How to migrate

External components implementing StreamWriter need to remove the destination parameter from the constructor to the superclass, i.e. replace extends StreamWriter(destination) with extends StreamWriter
No configuration properties need to be changed

Use prefix writer.parquet.option for extra configuration to the parquet writers

Currently, extra configuration to the parquet writers needs to be passed like this:

writer.parquet.extra.conf.1=key=value

key=value is split at the = sign, which is very confusing and unexpected.
Extra configuration should be added with a prefix
writer.parquet.option.key=value

Pass an instance of HyperdriveContext instead of accessing singleton object

In #114 , HyperdriveContext was introduced to maintain state across components. To avoid a breaking change it was introduced as a singleton object. However, this requires the state to be stored in a var or a mutable map.

Now, HyperdriveContext should be a class that is passed to the components (in the read, transform, write method etc.). Then, the state map can be an immutable val.

Update copyright

As agreed in CQC, the copyright should only include the inception year (2018 in this case), because the copyright is valid for 70 years at least. https://en.wikipedia.org/wiki/Computer_Programs_Directive

AllNullableParquetStreamWriter. AnalysisException: Queries with streaming sources must be executed with writeStream.start()

Hi @felipemmelo

I got this error when trying to ingest data with the AllNullableParquetStreamWriter

Exception in thread "main" za.co.absa.hyperdrive.shared.exceptions.IngestionStartException: NOT STARTED ingestion b4d8fd3d-43ec-4802-86b5-112d1aa62fb7. This exception was thrown during the starting of the ingestion job. Check the logs for details.
	at za.co.absa.hyperdrive.driver.SparkIngestor$.ingest(SparkIngestor.scala:104)
	at za.co.absa.hyperdrive.driver.IngestionDriver.ingest(IngestionDriver.scala:51)
	at za.co.absa.hyperdrive.driver.drivers.PropertiesIngestionDriver$.main(PropertiesIngestionDriver.scala:50)
	at za.co.absa.hyperdrive.driver.drivers.PropertiesIngestionDriver.main(PropertiesIngestionDriver.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
kafka
	at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:389)
	at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:38)
	at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
	at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:36)
	at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:51)
	at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:62)
	at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:60)
	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
	at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3037)
	at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3035)
	at za.co.absa.hyperdrive.shared.utils.SparkUtils$.setAllColumnsNullable(SparkUtils.scala:26)
	at za.co.absa.hyperdrive.ingestor.implementation.writer.parquet.AllNullableParquetStreamWriter.write(AllNullableParquetStreamWriter.scala:37)
	at za.co.absa.hyperdrive.driver.SparkIngestor$.ingest(SparkIngestor.scala:101)
	... 15 more
19/09/13 16:11:35 INFO SparkContext: Invoking stop() from shutdown hook

Ingestion.properties.template.txt

stacktrace.txt

I think the problem is that a new Dataframe is created in SparkUtils.setAllColumnsNullable instead of somehow reusing the old dataframe. Then we have two dataframes. One is associated to the reader and the other one to the writer

Publish all modules

Currently, the modules driver, ingestor-default and shared and testutils are not published to maven, because their API should not be public and backwards-compatibility within a major version is not guaranteed.
The disadvantage is that the main executable jar is not published on maven. However, the main jar should be available to download without the user having to build it himself.

Both goals can be achieved by publishing all modules to maven central, but making all classes in driver, ingestor-default, shared and testutils package-private to za.co.absa.hyperdrive which effectively marks them as a "private API". Users may download the main jar and execute it as is, but will still be prevented to use the code (unless they create a package with the same name)

Tasks

From the mentioned modules, remove the following from the pom files

            <plugin>
                <groupId>org.sonatype.plugins</groupId>
                <artifactId>nexus-staging-maven-plugin</artifactId>
                <version>${nexus.staging.plugin.version}</version>
                <configuration>
                    <skipNexusStagingDeployMojo>${skip.internal.modules.deployment}</skipNexusStagingDeployMojo>
                </configuration>
            </plugin>

Make all classes in the mentioned modules package private to za.co.absa.hyperdrive

Add generic partitioning option to ParquetStreamWriter

Currently, it's not possible to write a dataframe partitioned by arbitrary columns. This feature should be added in this PR.

Tasks

Add a configuration option writer.parquet.partitionby. It should accept a comma-separated list of column names
If present, the AbstractParquetStreamWriter should call .partitionBy on the DataStreamWriter

Related info
The ParquetPartitioningStreamWriter writes a dataframe partitioned by the current date and with an incrementing version number. However, this is very specialized logic and deserves a dedicated component, but should have a less general name (maybe rename in separate PR). In fact, with this PR, the ParquetPartitioningStreamWriter could be rewritten as a transformer, since it mainly adds two columns

Create end to end test

Currently, there is no end to end test that covers the whole pipeline from the KafkaStreamReader until the Parquet Stream Writers.

A test should be written that covers all of the pipeline.

absaoss / hyperdrive Goto Github PK

hyperdrive's Introduction

Hyperdrive - An extensible streaming ingestion pipeline on top of Apache Spark

Build Status

What is Hyperdrive?

Motivation

Architecture

Built-in components

Custom components

Usage

How to run

Configuration

General settings

Pipeline settings

Spark settings

Settings for built-in components

KafkaStreamReader

ParquetStreamReader

ConfluentAvroDecodingTransformer

ConfluentAvroEncodingTransformer

ColumnSelectorStreamTransformer

AddDateVersionTransformer

ColumnRenamingStreamTransformer

ColumnCopyStreamTransformer

DeduplicateKafkaSinkTransformer

ParquetStreamWriter

KafkaStreamWriter

MongoDbStreamWriter

DeltaCDCToSnapshotWriter

DeltaCDCToSCD2Writer

HudiCDCToSCD2Writer

Common writer properties

Behavior of Triggers

Hyperdrive Context

Secrets Providers

AWS SecretsManager

Other

Workflow Manager

How to build

E2E tests with Docker

How to measure code coverage

hyperdrive's People

Contributors

Stargazers

Watchers

Forkers

hyperdrive's Issues

Background

Feature

Other

Recommend Projects

Recommend Topics

Recommend Org

Jobs