GithubHelp home page GithubHelp logo

bakdata / kafka-large-message-serde Goto Github PK

View Code? Open in Web Editor NEW
52.0 10.0 8.0 390 KB

A Kafka Serde that reads and writes records from and to Blob storage (S3, Azure, Google) transparently.

Home Page: https://medium.com/bakdata/processing-large-messages-with-kafka-streams-167a166ca38b

License: MIT License

Java 100.00%
azure-blob-storage deserialization google-cloud-storage kafka kafka-streams large-data s3 serde serialization simple-storage-service

kafka-large-message-serde's Introduction

Build Status Quality Gate Status Coverage Maven

kafka-large-message-serde

A Kafka Serde that reads and writes records from and to a blob storage, such as Amazon S3, Azure Blob Storage, and Google Cloud Storage, transparently. Formerly known as kafka-s3-backed-serde.

Getting Started

Serde

You can add kafka-large-message-serde via Maven Central.

Gradle

implementation group: 'com.bakdata.kafka', name: 'large-message-serde', version: '2.0.0'

Maven

<dependency>
    <groupId>com.bakdata.kafka</groupId>
    <artifactId>large-message-serde</artifactId>
    <version>2.0.0</version>
</dependency>

For other build tools or versions, refer to the latest version in MvnRepository.

Make sure to also add Confluent Maven Repository to your build file.

Usage

You can use it from your Kafka Streams application like any other Serde

final Serde<String> serde = new LargeMessageSerde<>();
serde.configure(Map.of(AbstractLargeMessageConfig.BASE_PATH_CONFIG, "s3://my-bucket/",
        LargeMessageSerdeConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.StringSerde.class), false);

The following configuration options are available:

large.message.key.serde Key serde class to use. All serde configurations are also delegated to this serde.

  • Type: class
  • Default: org.apache.kafka.common.serialization.Serdes$ByteArraySerde
  • Importance: high

large.message.value.serde Value serde class to use. All serde configurations are also delegated to this serde.

  • Type: class
  • Default: org.apache.kafka.common.serialization.Serdes$ByteArraySerde
  • Importance: high

large.message.base.path Base path to store data. Must include bucket and any prefix that should be used, e.g., s3://my-bucket/my/prefix/. Available protocols: s3, abs.

  • Type: string
  • Default: ""
  • Importance: high

large.message.max.byte.size Maximum serialized message size in bytes before messages are stored on blob storage.

  • Type: int
  • Default: 1000000
  • Importance: medium

large.message.use.headers Enable if Kafka message headers should be used to distinguish blob storage backed messages. This is disabled by default for backwards compatibility but leads to increased memory usage. It is recommended to enable this option.

  • Type: boolean
  • Default: false
  • Importance: medium

large.message.accept.no.headers Enable if messages read with no headers should be treated as non-backed messages. This allows enabling of large message behavior for data that has been serialized using the wrapped serializer.

  • Type: boolean
  • Default: false
  • Importance: medium

large.message.id.generator Class to use for generating unique object IDs. Available generators are: com.bakdata.kafka.RandomUUIDGenerator, com.bakdata.kafka.Sha256HashIdGenerator, com.bakdata.kafka.MurmurHashIdGenerator.

  • Type: class
  • Default: com.bakdata.kafka.RandomUUIDGenerator
  • Importance: medium

large.message.s3.access.key AWS access key to use for connecting to S3. Leave empty if AWS credential provider chain or STS Assume Role provider should be used.

  • Type: password
  • Default: ""
  • Importance: low

large.message.s3.secret.key AWS secret key to use for connecting to S3. Leave empty if AWS credential provider chain or STS Assume Role provider should be used.

  • Type: password
  • Default: ""
  • Importance: low

large.message.s3.sts.role.arn AWS STS role ARN to use for connecting to S3. Leave empty if AWS Basic provider or AWS credential provider chain should be used.

  • Type: string
  • Default: ""
  • Importance: low

large.message.s3.role.external.id AWS STS role external ID used when retrieving session credentials under an assumed role. Leave empty if AWS Basic provider or AWS credential provider chain should be used.

  • Type: string
  • Default: ""
  • Importance: low

large.message.s3.role.session.name AWS STS role session name to use when starting a session. Leave empty if AWS Basic provider or AWS credential provider chain should be used.

  • Type: string
  • Default: ""
  • Importance: low

large.message.s3.jwt.path Path to an OIDC token file in JSON format (JWT) used to authenticate before AWS STS role authorisation, e.g. for EKS /var/run/secrets/eks.amazonaws.com/serviceaccount/token.

  • Type: string
  • Default: ""
  • Importance: low

large.message.s3.region S3 region to use. Leave empty if default S3 region should be used.

  • Type: string
  • Default: ""
  • Importance: low

large.message.s3.endpoint Endpoint to use for connection to Amazon S3. Leave empty if default S3 endpoint should be used.

  • Type: string
  • Default: ""
  • Importance: low

large.message.abs.connection.string Azure connection string for connection to blob storage. Leave empty if Azure credential provider chain should be used.

  • Type: password
  • Default: ""
  • Importance: low

large.message.gs.key.path Google service account key JSON path. Leave empty If the environment variable GOOGLE_APPLICATION_CREDENTIALS is set or if you want to use the default service account provided by your computer engine. For more information about authenticating as a service account please refer to the main documentation.

  • Type: string
  • Default: ""
  • Importance: low

large.message.compression.type The compression type for data stored in blob storage. The default is none (i.e. no compression). Valid values are none, gzip, snappy, lz4 and zstd. Note: this option is only available when large.message.use.headers is enabled.

  • Type: string
  • Default: "none"
  • Importance: low

Kafka Connect

This serde also comes with support for Kafka Connect. You can add kafka-large-message-connect via Maven Central.

Gradle

implementation group: 'com.bakdata.kafka', name: 'large-message-connect', version: '1.1.6'

Maven

<dependency>
    <groupId>com.bakdata.kafka</groupId>
    <artifactId>large-message-connect</artifactId>
    <version>2.0.0</version>
</dependency>

For other build tools or versions, refer to the latest version in MvnRepository.

Usage

To use it with your Kafka Connect jobs, just configure your converter as com.bakdata.kafka.LargeMessageConverter.

In addition to the configurations available for the serde (except large.message.key.serde and large.message.value.serde), you can configure the following:

large.message.converter Converter to use. All converter configurations are also delegated to this converter.

  • Type: class
  • Default: org.apache.kafka.connect.converters.ByteArrayConverter
  • Importance: high

For general guidance on how to configure Kafka Connect converters, please have a look at the official documentation.

Cleaning up the bucket

We also provide a method for cleaning up all files on the blob storage associated with a topic:

final Map<String, Object> properties = ...;
final AbstractLargeMessageConfig config = new AbstractLargeMessageConfig(properties);
final LargeMessageStoringClient storer = config.getStorer()
storer.deleteAllFiles("topic");

Development

If you want to contribute to this project, you can simply clone the repository and build it via Gradle. All dependencies should be included in the Gradle files, there are no external prerequisites.

> git clone [email protected]:bakdata/kafka-large-message-serde.git
> cd kafka-large-message-serde && ./gradlew build

Please note, that we have code styles for Java. They are basically the Google style guide, with some small modifications.

Contributing

We are happy if you want to contribute to this project. If you find any bugs or have suggestions for improvements, please open an issue. We are also happy to accept your PRs. Just open an issue beforehand and let us know what you want to do and why.

License

This project is licensed under the MIT license. Have a look at the LICENSE for more details.

kafka-large-message-serde's People

Contributors

bakdata-bot avatar disrupted avatar domsj avatar michaelkora avatar philipp94831 avatar raminqaf avatar svenlehmann avatar tomczik76 avatar torbsto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kafka-large-message-serde's Issues

Can't build project

Hello, I'm trying to build this project but am getting the following error.

./gradlew build

> Task :s3-backed-serde:compileTestJava FAILED
/Users/ryan.tomczik/Dev/kafka-s3-backed-serde/s3-backed-serde/src/test/java/com/bakdata/kafka/S3BackedDeserializerTest.java:32: error: cannot access TestTopology
import com.bakdata.fluent_kafka_streams_tests.TestTopology;
                                             ^
  bad class file: /Users/ryan.tomczik/.gradle/caches/modules-2/files-2.1/com.bakdata.fluent-kafka-streams-tests/fluent-kafka-streams-tests/2.0.4/6884c7dff35da93f26a35cf405188cbdb009202b/fluent-kafka-streams-tests-2.0.4.jar(com/bakdata/fluent_kafka_streams_tests/TestTopology.class)
    class file has wrong version 55.0, should be 52.0
    Please remove or make sure it appears in the correct subdirectory of the classpath.

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':s3-backed-serde:compileTestJava'.
> Compilation failed; see the compiler error output for details.

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 1s
17 actionable tasks: 1 executed, 16 up-to-date

I'm not too familiar with Gradle so I'm not sure if there is something I'm missing here.

Thanks,
Ryan

Kafka Connect Rest API properties and Connect Env Vars not being honored

I have created a connector with the following environment variables and config properties

Environment Variables

  CONNECT_VALUE_CONVERTER: "com.bakdata.kafka.S3BackedConverter"
  CONNECT_KEY_CONVERTER: "com.bakdata.kafka.S3BackedConverter"
  CONNECT_S3BACKED_CONVERTER: "io.confluent.connect.avro.AvroConverter"
  CONNECT_S3BACKED_MAX_BYTE_SIZE: "2097150"

Rest API config

{
   "config": {
     ....
     "s3backed.sts.role.arn": "redacted",
     "s3backed.role.external.id": "s3BackedKafka",
     "s3backed.role.session.name": "default",
     "s3backed.region": "us-east-1",
     "s3backed.base.path": "redacted",
     "s3backed.max.byte.size:": 50,
     "s3backed.converter": "io.confluent.connect.avro.AvroConverter"

   },
   "name": "test-connector"
 }

Logs

[2021-04-27 19:52:01,801] INFO S3BackedConverterConfig values: 
	s3backed.sts.role.session.name = 
	s3backed.converter = class org.apache.kafka.connect.converters.ByteArrayConverter
	s3backed.endpoint = 
	s3backed.base.path = 
	s3backed.access.key = [hidden]
	s3backed.id.generator = class com.bakdata.kafka.RandomUUIDGenerator
	s3backed.sts.role.external.id = 
	s3backed.secret.key = [hidden]
	s3backed.max.byte.size = 1000000
	s3backed.sts.role.arn = 
	s3backed.region = 
	s3backed.path.style.access = false
 (com.bakdata.kafka.S3BackedConverterConfig)

As you can see all the defaults are being used for the S3BackedConverterConfig.

Invalid value null for configuration key serializer exception

I'm trying out this library with the sample code provided here but I'm getting an exception.

Sample Code:

public class KafkaLargeMessageSerdeProcessor {

    public void processLargeMessageSerde(String filename, String topicName) throws IOException {

        final Properties props = new Properties();
        props.setProperty(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.setProperty(StreamsConfig.APPLICATION_ID_CONFIG, "s3-backed-serde-app");
        props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.StringSerde.class);
        props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, S3BackedSerde.class);
        props.setProperty(AbstractS3BackedConfig.BASE_PATH_CONFIG, "s3://large-blob-data-test/");
        props.put(S3BackedSerdeConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.StringSerde.class);

        Producer<String, String> producer = new KafkaProducer<>(props);

        FileInputStream inputStream;
        try {
            inputStream = new FileInputStream(filename);
            byte[] buffer = new byte[1024 * 1024];
            int bytesRead;
            while ((bytesRead = inputStream.read(buffer)) != -1) {
                String chunk = new String(buffer, 0, bytesRead, StandardCharsets.UTF_8);
                ProducerRecord<String, String> record = new ProducerRecord<>(topicName, chunk);
                producer.send(record);
            }
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
        try {
            producer.close();
            inputStream.close();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }

    }
}

Error:

Exception in thread "main" org.apache.kafka.common.config.ConfigException: Invalid value null for configuration key.serializer: must be non-null.
	at org.apache.kafka.clients.producer.ProducerConfig.appendSerializerToConfig(ProducerConfig.java:579)
	at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:290)
	at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:317)
	at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:302)
	at com.example.KafkaLargeMessageSerdeProcessor.processLargeMessageSerde(KafkaLargeMessageSerdeProcessor.java:29)
	at com.example.Main.main(Main.java:14)

I just want to know what I'm missing here.

Optimize memory usage

At the moment the data is prefixed on kafka with a "magic byte" that indicates whether the data is stored on kafka or not. However, to be able to do so, a byte array needs to be allocated twice, once for the raw data, and once for the raw data + magic byte. Since we're looking at this project to process messages that are potentially over 100mb big, such an extra allocation can be painful.

What I want to suggest (and I'm happy to submit a PR for this):

  • Instead of prepending the message with a magic byte, use a header on kafka to store this info.
  • Make this a configuration flag which is off by default to not break things.

Added bonus is that messages that are not send to the large-message-store can be read by any app/console-consumer with the default serializer you use.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.