GithubHelp home page GithubHelp logo

bakdata / kafka-large-message-serde Goto Github PK

View Code? Open in Web Editor NEW
52.0 10.0 8.0 390 KB

A Kafka Serde that reads and writes records from and to Blob storage (S3, Azure, Google) transparently.

Home Page: https://medium.com/bakdata/processing-large-messages-with-kafka-streams-167a166ca38b

License: MIT License

Java 100.00%
azure-blob-storage deserialization google-cloud-storage kafka kafka-streams large-data s3 serde serialization simple-storage-service

kafka-large-message-serde's Issues

Invalid value null for configuration key serializer exception

I'm trying out this library with the sample code provided here but I'm getting an exception.

Sample Code:

public class KafkaLargeMessageSerdeProcessor {

    public void processLargeMessageSerde(String filename, String topicName) throws IOException {

        final Properties props = new Properties();
        props.setProperty(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.setProperty(StreamsConfig.APPLICATION_ID_CONFIG, "s3-backed-serde-app");
        props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.StringSerde.class);
        props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, S3BackedSerde.class);
        props.setProperty(AbstractS3BackedConfig.BASE_PATH_CONFIG, "s3://large-blob-data-test/");
        props.put(S3BackedSerdeConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.StringSerde.class);

        Producer<String, String> producer = new KafkaProducer<>(props);

        FileInputStream inputStream;
        try {
            inputStream = new FileInputStream(filename);
            byte[] buffer = new byte[1024 * 1024];
            int bytesRead;
            while ((bytesRead = inputStream.read(buffer)) != -1) {
                String chunk = new String(buffer, 0, bytesRead, StandardCharsets.UTF_8);
                ProducerRecord<String, String> record = new ProducerRecord<>(topicName, chunk);
                producer.send(record);
            }
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
        try {
            producer.close();
            inputStream.close();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }

    }
}

Error:

Exception in thread "main" org.apache.kafka.common.config.ConfigException: Invalid value null for configuration key.serializer: must be non-null.
	at org.apache.kafka.clients.producer.ProducerConfig.appendSerializerToConfig(ProducerConfig.java:579)
	at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:290)
	at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:317)
	at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:302)
	at com.example.KafkaLargeMessageSerdeProcessor.processLargeMessageSerde(KafkaLargeMessageSerdeProcessor.java:29)
	at com.example.Main.main(Main.java:14)

I just want to know what I'm missing here.

Kafka Connect Rest API properties and Connect Env Vars not being honored

I have created a connector with the following environment variables and config properties

Environment Variables

  CONNECT_VALUE_CONVERTER: "com.bakdata.kafka.S3BackedConverter"
  CONNECT_KEY_CONVERTER: "com.bakdata.kafka.S3BackedConverter"
  CONNECT_S3BACKED_CONVERTER: "io.confluent.connect.avro.AvroConverter"
  CONNECT_S3BACKED_MAX_BYTE_SIZE: "2097150"

Rest API config

{
   "config": {
     ....
     "s3backed.sts.role.arn": "redacted",
     "s3backed.role.external.id": "s3BackedKafka",
     "s3backed.role.session.name": "default",
     "s3backed.region": "us-east-1",
     "s3backed.base.path": "redacted",
     "s3backed.max.byte.size:": 50,
     "s3backed.converter": "io.confluent.connect.avro.AvroConverter"

   },
   "name": "test-connector"
 }

Logs

[2021-04-27 19:52:01,801] INFO S3BackedConverterConfig values: 
	s3backed.sts.role.session.name = 
	s3backed.converter = class org.apache.kafka.connect.converters.ByteArrayConverter
	s3backed.endpoint = 
	s3backed.base.path = 
	s3backed.access.key = [hidden]
	s3backed.id.generator = class com.bakdata.kafka.RandomUUIDGenerator
	s3backed.sts.role.external.id = 
	s3backed.secret.key = [hidden]
	s3backed.max.byte.size = 1000000
	s3backed.sts.role.arn = 
	s3backed.region = 
	s3backed.path.style.access = false
 (com.bakdata.kafka.S3BackedConverterConfig)

As you can see all the defaults are being used for the S3BackedConverterConfig.

Can't build project

Hello, I'm trying to build this project but am getting the following error.

./gradlew build

> Task :s3-backed-serde:compileTestJava FAILED
/Users/ryan.tomczik/Dev/kafka-s3-backed-serde/s3-backed-serde/src/test/java/com/bakdata/kafka/S3BackedDeserializerTest.java:32: error: cannot access TestTopology
import com.bakdata.fluent_kafka_streams_tests.TestTopology;
                                             ^
  bad class file: /Users/ryan.tomczik/.gradle/caches/modules-2/files-2.1/com.bakdata.fluent-kafka-streams-tests/fluent-kafka-streams-tests/2.0.4/6884c7dff35da93f26a35cf405188cbdb009202b/fluent-kafka-streams-tests-2.0.4.jar(com/bakdata/fluent_kafka_streams_tests/TestTopology.class)
    class file has wrong version 55.0, should be 52.0
    Please remove or make sure it appears in the correct subdirectory of the classpath.

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':s3-backed-serde:compileTestJava'.
> Compilation failed; see the compiler error output for details.

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 1s
17 actionable tasks: 1 executed, 16 up-to-date

I'm not too familiar with Gradle so I'm not sure if there is something I'm missing here.

Thanks,
Ryan

Optimize memory usage

At the moment the data is prefixed on kafka with a "magic byte" that indicates whether the data is stored on kafka or not. However, to be able to do so, a byte array needs to be allocated twice, once for the raw data, and once for the raw data + magic byte. Since we're looking at this project to process messages that are potentially over 100mb big, such an extra allocation can be painful.

What I want to suggest (and I'm happy to submit a PR for this):

  • Instead of prepending the message with a magic byte, use a header on kafka to store this info.
  • Make this a configuration flag which is off by default to not break things.

Added bonus is that messages that are not send to the large-message-store can be read by any app/console-consumer with the default serializer you use.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.