bakdata / kafka-large-message-serde Goto Github PK

A Kafka Serde that reads and writes records from and to Blob storage (S3, Azure, Google) transparently.

Home Page: https://medium.com/bakdata/processing-large-messages-with-kafka-streams-167a166ca38b

License: MIT License

Java 100.00%

azure-blob-storage deserialization google-cloud-storage kafka kafka-streams large-data s3 serde serialization simple-storage-service

kafka-large-message-serde's Issues

AWS AssumeRoleWithWebIdentity support

Hi, I'm using the Serde with Kafka Connect, AWS S3 and STS AssumeRole. It works properly, however I now have the requirement to move to AssumeRoleWithWebIdentity. Is that perhaps something that you plan for next releases? If not, are you open to external contributions?

Upstep AWS SDK to version 2

I've noticed this project uses version 1 of the AWS SDK. While it's still supported, more and more projects are moving to version 2. This library is currently the reason why we use both versions of the SDK in our project, and ideally we'd like to reduce it to only 1 version.

Developer guide: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/home.html

Invalid value null for configuration key serializer exception

I'm trying out this library with the sample code provided here but I'm getting an exception.

Sample Code:

public class KafkaLargeMessageSerdeProcessor {

    public void processLargeMessageSerde(String filename, String topicName) throws IOException {

        final Properties props = new Properties();
        props.setProperty(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.setProperty(StreamsConfig.APPLICATION_ID_CONFIG, "s3-backed-serde-app");
        props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.StringSerde.class);
        props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, S3BackedSerde.class);
        props.setProperty(AbstractS3BackedConfig.BASE_PATH_CONFIG, "s3://large-blob-data-test/");
        props.put(S3BackedSerdeConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.StringSerde.class);

        Producer<String, String> producer = new KafkaProducer<>(props);

        FileInputStream inputStream;
        try {
            inputStream = new FileInputStream(filename);
            byte[] buffer = new byte[1024 * 1024];
            int bytesRead;
            while ((bytesRead = inputStream.read(buffer)) != -1) {
                String chunk = new String(buffer, 0, bytesRead, StandardCharsets.UTF_8);
                ProducerRecord<String, String> record = new ProducerRecord<>(topicName, chunk);
                producer.send(record);
            }
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
        try {
            producer.close();
            inputStream.close();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }

    }
}

Error:

Exception in thread "main" org.apache.kafka.common.config.ConfigException: Invalid value null for configuration key.serializer: must be non-null.
	at org.apache.kafka.clients.producer.ProducerConfig.appendSerializerToConfig(ProducerConfig.java:579)
	at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:290)
	at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:317)
	at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:302)
	at com.example.KafkaLargeMessageSerdeProcessor.processLargeMessageSerde(KafkaLargeMessageSerdeProcessor.java:29)
	at com.example.Main.main(Main.java:14)

I just want to know what I'm missing here.

Kafka Connect Rest API properties and Connect Env Vars not being honored

I have created a connector with the following environment variables and config properties

Environment Variables

  CONNECT_VALUE_CONVERTER: "com.bakdata.kafka.S3BackedConverter"
  CONNECT_KEY_CONVERTER: "com.bakdata.kafka.S3BackedConverter"
  CONNECT_S3BACKED_CONVERTER: "io.confluent.connect.avro.AvroConverter"
  CONNECT_S3BACKED_MAX_BYTE_SIZE: "2097150"

Rest API config

{
   "config": {
     ....
     "s3backed.sts.role.arn": "redacted",
     "s3backed.role.external.id": "s3BackedKafka",
     "s3backed.role.session.name": "default",
     "s3backed.region": "us-east-1",
     "s3backed.base.path": "redacted",
     "s3backed.max.byte.size:": 50,
     "s3backed.converter": "io.confluent.connect.avro.AvroConverter"

   },
   "name": "test-connector"
 }

Logs

[2021-04-27 19:52:01,801] INFO S3BackedConverterConfig values: 
	s3backed.sts.role.session.name = 
	s3backed.converter = class org.apache.kafka.connect.converters.ByteArrayConverter
	s3backed.endpoint = 
	s3backed.base.path = 
	s3backed.access.key = [hidden]
	s3backed.id.generator = class com.bakdata.kafka.RandomUUIDGenerator
	s3backed.sts.role.external.id = 
	s3backed.secret.key = [hidden]
	s3backed.max.byte.size = 1000000
	s3backed.sts.role.arn = 
	s3backed.region = 
	s3backed.path.style.access = false
 (com.bakdata.kafka.S3BackedConverterConfig)

As you can see all the defaults are being used for the S3BackedConverterConfig.

Add support for Google Cloud Storage

Currently, the project is supporting AWS S3 and Azure Blob Storage. It is good to have also support for Google Cloud Storage.

Can't build project

Hello, I'm trying to build this project but am getting the following error.

./gradlew build

> Task :s3-backed-serde:compileTestJava FAILED
/Users/ryan.tomczik/Dev/kafka-s3-backed-serde/s3-backed-serde/src/test/java/com/bakdata/kafka/S3BackedDeserializerTest.java:32: error: cannot access TestTopology
import com.bakdata.fluent_kafka_streams_tests.TestTopology;
                                             ^
  bad class file: /Users/ryan.tomczik/.gradle/caches/modules-2/files-2.1/com.bakdata.fluent-kafka-streams-tests/fluent-kafka-streams-tests/2.0.4/6884c7dff35da93f26a35cf405188cbdb009202b/fluent-kafka-streams-tests-2.0.4.jar(com/bakdata/fluent_kafka_streams_tests/TestTopology.class)
    class file has wrong version 55.0, should be 52.0
    Please remove or make sure it appears in the correct subdirectory of the classpath.

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':s3-backed-serde:compileTestJava'.
> Compilation failed; see the compiler error output for details.

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 1s
17 actionable tasks: 1 executed, 16 up-to-date

I'm not too familiar with Gradle so I'm not sure if there is something I'm missing here.

Thanks,
Ryan

Optimize memory usage

At the moment the data is prefixed on kafka with a "magic byte" that indicates whether the data is stored on kafka or not. However, to be able to do so, a byte array needs to be allocated twice, once for the raw data, and once for the raw data + magic byte. Since we're looking at this project to process messages that are potentially over 100mb big, such an extra allocation can be painful.

What I want to suggest (and I'm happy to submit a PR for this):

Instead of prepending the message with a magic byte, use a header on kafka to store this info.
Make this a configuration flag which is off by default to not break things.

Added bonus is that messages that are not send to the large-message-store can be read by any app/console-consumer with the default serializer you use.

bakdata / kafka-large-message-serde Goto Github PK

kafka-large-message-serde's Issues

AWS AssumeRoleWithWebIdentity support

Upstep AWS SDK to version 2

Invalid value null for configuration key serializer exception

Kafka Connect Rest API properties and Connect Env Vars not being honored

Add support for Google Cloud Storage

Can't build project

Optimize memory usage

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs