bakdata / kafka-large-message-serde Goto Github PK

A Kafka Serde that reads and writes records from and to Blob storage (S3, Azure, Google) transparently.

Home Page: https://medium.com/bakdata/processing-large-messages-with-kafka-streams-167a166ca38b

License: MIT License

Java 100.00%

azure-blob-storage deserialization google-cloud-storage kafka kafka-streams large-data s3 serde serialization simple-storage-service

kafka-large-message-serde's Introduction

kafka-large-message-serde

A Kafka Serde that reads and writes records from and to a blob storage, such as Amazon S3, Azure Blob Storage, and Google Cloud Storage, transparently. Formerly known as kafka-s3-backed-serde.

Getting Started

Serde

You can add kafka-large-message-serde via Maven Central.

Gradle

implementation group: 'com.bakdata.kafka', name: 'large-message-serde', version: '2.0.0'

Maven

<dependency>
    <groupId>com.bakdata.kafka</groupId>
    <artifactId>large-message-serde</artifactId>
    <version>2.0.0</version>
</dependency>

For other build tools or versions, refer to the latest version in MvnRepository.

Make sure to also add Confluent Maven Repository to your build file.

Usage

You can use it from your Kafka Streams application like any other Serde

final Serde<String> serde = new LargeMessageSerde<>();
serde.configure(Map.of(AbstractLargeMessageConfig.BASE_PATH_CONFIG, "s3://my-bucket/",
        LargeMessageSerdeConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.StringSerde.class), false);

The following configuration options are available:

large.message.key.serde Key serde class to use. All serde configurations are also delegated to this serde.

Type: class
Default: org.apache.kafka.common.serialization.Serdes$ByteArraySerde
Importance: high

large.message.value.serde Value serde class to use. All serde configurations are also delegated to this serde.

Type: class
Default: org.apache.kafka.common.serialization.Serdes$ByteArraySerde
Importance: high

large.message.base.path Base path to store data. Must include bucket and any prefix that should be used, e.g., s3://my-bucket/my/prefix/. Available protocols: s3, abs.

Type: string
Default: ""
Importance: high

large.message.max.byte.size Maximum serialized message size in bytes before messages are stored on blob storage.

Type: int
Default: 1000000
Importance: medium

large.message.use.headers Enable if Kafka message headers should be used to distinguish blob storage backed messages. This is disabled by default for backwards compatibility but leads to increased memory usage. It is recommended to enable this option.

Type: boolean
Default: false
Importance: medium

large.message.accept.no.headers Enable if messages read with no headers should be treated as non-backed messages. This allows enabling of large message behavior for data that has been serialized using the wrapped serializer.

Type: boolean
Default: false
Importance: medium

large.message.id.generator Class to use for generating unique object IDs. Available generators are: com.bakdata.kafka.RandomUUIDGenerator, com.bakdata.kafka.Sha256HashIdGenerator, com.bakdata.kafka.MurmurHashIdGenerator.

Type: class
Default: com.bakdata.kafka.RandomUUIDGenerator
Importance: medium

large.message.s3.access.key AWS access key to use for connecting to S3. Leave empty if AWS credential provider chain or STS Assume Role provider should be used.

Type: password
Default: ""
Importance: low

large.message.s3.secret.key AWS secret key to use for connecting to S3. Leave empty if AWS credential provider chain or STS Assume Role provider should be used.

Type: password
Default: ""
Importance: low

large.message.s3.sts.role.arn AWS STS role ARN to use for connecting to S3. Leave empty if AWS Basic provider or AWS credential provider chain should be used.

Type: string
Default: ""
Importance: low

large.message.s3.role.external.id AWS STS role external ID used when retrieving session credentials under an assumed role. Leave empty if AWS Basic provider or AWS credential provider chain should be used.

Type: string
Default: ""
Importance: low

large.message.s3.role.session.name AWS STS role session name to use when starting a session. Leave empty if AWS Basic provider or AWS credential provider chain should be used.

Type: string
Default: ""
Importance: low

large.message.s3.jwt.path Path to an OIDC token file in JSON format (JWT) used to authenticate before AWS STS role authorisation, e.g. for EKS /var/run/secrets/eks.amazonaws.com/serviceaccount/token.

Type: string
Default: ""
Importance: low

large.message.s3.region S3 region to use. Leave empty if default S3 region should be used.

Type: string
Default: ""
Importance: low

large.message.s3.endpoint Endpoint to use for connection to Amazon S3. Leave empty if default S3 endpoint should be used.

Type: string
Default: ""
Importance: low

large.message.abs.connection.string Azure connection string for connection to blob storage. Leave empty if Azure credential provider chain should be used.

Type: password
Default: ""
Importance: low

large.message.gs.key.path Google service account key JSON path. Leave empty If the environment variable GOOGLE_APPLICATION_CREDENTIALS is set or if you want to use the default service account provided by your computer engine. For more information about authenticating as a service account please refer to the main documentation.

Type: string
Default: ""
Importance: low

large.message.compression.type The compression type for data stored in blob storage. The default is none (i.e. no compression). Valid values are none, gzip, snappy, lz4 and zstd. Note: this option is only available when large.message.use.headers is enabled.

Type: string
Default: "none"
Importance: low

Kafka Connect

This serde also comes with support for Kafka Connect. You can add kafka-large-message-connect via Maven Central.

Gradle

implementation group: 'com.bakdata.kafka', name: 'large-message-connect', version: '1.1.6'

Maven

<dependency>
    <groupId>com.bakdata.kafka</groupId>
    <artifactId>large-message-connect</artifactId>
    <version>2.0.0</version>
</dependency>

For other build tools or versions, refer to the latest version in MvnRepository.

Usage

To use it with your Kafka Connect jobs, just configure your converter as com.bakdata.kafka.LargeMessageConverter.

In addition to the configurations available for the serde (except large.message.key.serde and large.message.value.serde), you can configure the following:

large.message.converter Converter to use. All converter configurations are also delegated to this converter.

Type: class
Default: org.apache.kafka.connect.converters.ByteArrayConverter
Importance: high

For general guidance on how to configure Kafka Connect converters, please have a look at the official documentation.

Cleaning up the bucket

We also provide a method for cleaning up all files on the blob storage associated with a topic:

final Map<String, Object> properties = ...;
final AbstractLargeMessageConfig config = new AbstractLargeMessageConfig(properties);
final LargeMessageStoringClient storer = config.getStorer()
storer.deleteAllFiles("topic");

Development

If you want to contribute to this project, you can simply clone the repository and build it via Gradle. All dependencies should be included in the Gradle files, there are no external prerequisites.

> git clone [email protected]:bakdata/kafka-large-message-serde.git
> cd kafka-large-message-serde && ./gradlew build

Please note, that we have code styles for Java. They are basically the Google style guide, with some small modifications.

Contributing

We are happy if you want to contribute to this project. If you find any bugs or have suggestions for improvements, please open an issue. We are also happy to accept your PRs. Just open an issue beforehand and let us know what you want to do and why.

License

This project is licensed under the MIT license. Have a look at the LICENSE for more details.

kafka-large-message-serde's People

Contributors

Stargazers

Watchers

Forkers

tiagopereira92 tomczik76 raminqaf jclarysse pedropietro domsj sahilalipuria rhtevan

kafka-large-message-serde's Issues

Upstep AWS SDK to version 2

I've noticed this project uses version 1 of the AWS SDK. While it's still supported, more and more projects are moving to version 2. This library is currently the reason why we use both versions of the SDK in our project, and ideally we'd like to reduce it to only 1 version.

Developer guide: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/home.html

Can't build project

Hello, I'm trying to build this project but am getting the following error.

./gradlew build

> Task :s3-backed-serde:compileTestJava FAILED
/Users/ryan.tomczik/Dev/kafka-s3-backed-serde/s3-backed-serde/src/test/java/com/bakdata/kafka/S3BackedDeserializerTest.java:32: error: cannot access TestTopology
import com.bakdata.fluent_kafka_streams_tests.TestTopology;
                                             ^
  bad class file: /Users/ryan.tomczik/.gradle/caches/modules-2/files-2.1/com.bakdata.fluent-kafka-streams-tests/fluent-kafka-streams-tests/2.0.4/6884c7dff35da93f26a35cf405188cbdb009202b/fluent-kafka-streams-tests-2.0.4.jar(com/bakdata/fluent_kafka_streams_tests/TestTopology.class)
    class file has wrong version 55.0, should be 52.0
    Please remove or make sure it appears in the correct subdirectory of the classpath.

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':s3-backed-serde:compileTestJava'.
> Compilation failed; see the compiler error output for details.

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 1s
17 actionable tasks: 1 executed, 16 up-to-date

I'm not too familiar with Gradle so I'm not sure if there is something I'm missing here.

Thanks,
Ryan

Kafka Connect Rest API properties and Connect Env Vars not being honored

I have created a connector with the following environment variables and config properties

Environment Variables

  CONNECT_VALUE_CONVERTER: "com.bakdata.kafka.S3BackedConverter"
  CONNECT_KEY_CONVERTER: "com.bakdata.kafka.S3BackedConverter"
  CONNECT_S3BACKED_CONVERTER: "io.confluent.connect.avro.AvroConverter"
  CONNECT_S3BACKED_MAX_BYTE_SIZE: "2097150"

Rest API config

{
   "config": {
     ....
     "s3backed.sts.role.arn": "redacted",
     "s3backed.role.external.id": "s3BackedKafka",
     "s3backed.role.session.name": "default",
     "s3backed.region": "us-east-1",
     "s3backed.base.path": "redacted",
     "s3backed.max.byte.size:": 50,
     "s3backed.converter": "io.confluent.connect.avro.AvroConverter"

   },
   "name": "test-connector"
 }

Logs

[2021-04-27 19:52:01,801] INFO S3BackedConverterConfig values: 
	s3backed.sts.role.session.name = 
	s3backed.converter = class org.apache.kafka.connect.converters.ByteArrayConverter
	s3backed.endpoint = 
	s3backed.base.path = 
	s3backed.access.key = [hidden]
	s3backed.id.generator = class com.bakdata.kafka.RandomUUIDGenerator
	s3backed.sts.role.external.id = 
	s3backed.secret.key = [hidden]
	s3backed.max.byte.size = 1000000
	s3backed.sts.role.arn = 
	s3backed.region = 
	s3backed.path.style.access = false
 (com.bakdata.kafka.S3BackedConverterConfig)

As you can see all the defaults are being used for the S3BackedConverterConfig.

AWS AssumeRoleWithWebIdentity support

Hi, I'm using the Serde with Kafka Connect, AWS S3 and STS AssumeRole. It works properly, however I now have the requirement to move to AssumeRoleWithWebIdentity. Is that perhaps something that you plan for next releases? If not, are you open to external contributions?

Invalid value null for configuration key serializer exception

I'm trying out this library with the sample code provided here but I'm getting an exception.

Sample Code:

public class KafkaLargeMessageSerdeProcessor {

    public void processLargeMessageSerde(String filename, String topicName) throws IOException {

        final Properties props = new Properties();
        props.setProperty(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.setProperty(StreamsConfig.APPLICATION_ID_CONFIG, "s3-backed-serde-app");
        props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.StringSerde.class);
        props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, S3BackedSerde.class);
        props.setProperty(AbstractS3BackedConfig.BASE_PATH_CONFIG, "s3://large-blob-data-test/");
        props.put(S3BackedSerdeConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.StringSerde.class);

        Producer<String, String> producer = new KafkaProducer<>(props);

        FileInputStream inputStream;
        try {
            inputStream = new FileInputStream(filename);
            byte[] buffer = new byte[1024 * 1024];
            int bytesRead;
            while ((bytesRead = inputStream.read(buffer)) != -1) {
                String chunk = new String(buffer, 0, bytesRead, StandardCharsets.UTF_8);
                ProducerRecord<String, String> record = new ProducerRecord<>(topicName, chunk);
                producer.send(record);
            }
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
        try {
            producer.close();
            inputStream.close();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }

    }
}

Error:

Exception in thread "main" org.apache.kafka.common.config.ConfigException: Invalid value null for configuration key.serializer: must be non-null.
	at org.apache.kafka.clients.producer.ProducerConfig.appendSerializerToConfig(ProducerConfig.java:579)
	at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:290)
	at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:317)
	at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:302)
	at com.example.KafkaLargeMessageSerdeProcessor.processLargeMessageSerde(KafkaLargeMessageSerdeProcessor.java:29)
	at com.example.Main.main(Main.java:14)

I just want to know what I'm missing here.

Optimize memory usage

At the moment the data is prefixed on kafka with a "magic byte" that indicates whether the data is stored on kafka or not. However, to be able to do so, a byte array needs to be allocated twice, once for the raw data, and once for the raw data + magic byte. Since we're looking at this project to process messages that are potentially over 100mb big, such an extra allocation can be painful.

What I want to suggest (and I'm happy to submit a PR for this):

Instead of prepending the message with a magic byte, use a header on kafka to store this info.
Make this a configuration flag which is off by default to not break things.

Added bonus is that messages that are not send to the large-message-store can be read by any app/console-consumer with the default serializer you use.

Add support for Google Cloud Storage

Currently, the project is supporting AWS S3 and Azure Blob Storage. It is good to have also support for Google Cloud Storage.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble