wix-incubator / kafka-connect-s3 Goto Github PK

A Kafka-Connect Sink for S3 with no Hadoop dependencies.

License: Other

Java 73.99% Python 26.01%

kafka-connect-s3's Issues

the local buffer does not automatically clean up

Hi, I am running into the a issue trying to use kafka-connect-s3.

I specified a local buffer directory with
local.buffer.dir=/tmp/connect-system-test
but those files keep growing and does not get automatically clean up

is there any other setting I should specify?
Thanks

compressed_block_size not working

The connector produces tiny files on S3, in some cases one message per file.
I suspect it's because the file name in every batch is different so it doesn't know that it should append to the previous file.

How do I run generated jar..?

Once the jar is built locally on mac. With all the tests passing. How do I run this App?. There are no instructions specific to it in Docs. As I can see local-run.sh is missing

Sorry for confusion: I am running this command: connect-standalone /usr/local/etc/kafka/connect-standalone.properties connect-s3-sink.properties
Unfotunately I get this error:Error: Could not find or load main class org.apache.kafka.connect.cli.ConnectStandalone

AmazonS3Exception: The specified key does not exist

When I submit an s3 sink to my Kafka Connect cluster, the logs show an error about the S3 key /last_chunk_index.mytopic-00000.txt being missing.

Am I supposed to manually create this file in S3, or is the sink supposed to create this file?

More flexible S3 object naming scheme

kafka-connect-s3 currently includes information about Kafka partitions and offsets when formatting the S3 object name, correct?

My use case is more timestamp-oriented and less partition/offset-oriented. Could kafka-connect-s3 generalize the S3 object naming schema, such as allowing configuration to specify a class implementing an interface like String s3ObjectName(byte[] message)? That way, users could customize the S3 name to suit their backup system needs better.

Uncaught exception for AWS S3 connection

I'm attempting to test the Sink connector in the latest Kafka 0.9.0.1 framework. When attempting to launch a standalone connect worker task with the S3 sink configured in it (using the example*.properties files modified for my environment, the following exception is thrown at startup

Exception in thread "WorkerSinkTask-s3-sink-0" java.lang.NoSuchFieldError: INSTANCE
at com.amazonaws.http.conn.SdkConnectionKeepAliveStrategy.getKeepAliveDuration(SdkConnectionKeepAliveStrategy.java:48)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:535)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:822)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:576)
at com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:362)
at com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:328)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:307)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3643)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1148)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1037)
at com.deviantart.kafka_connect_s3.S3Writer.fetchOffset(S3Writer.java:100)
at com.deviantart.kafka_connect_s3.S3SinkTask.recoverPartition(S3SinkTask.java:193)
at com.deviantart.kafka_connect_s3.S3SinkTask.recoverAssignment(S3SinkTask.java:181)
at com.deviantart.kafka_connect_s3.S3SinkTask.start(S3SinkTask.java:84)
at org.apache.kafka.connect.runtime.WorkerSinkTask.joinConsumerGroupAndStart(WorkerSinkTask.java:154)
at org.apache.kafka.connect.runtime.WorkerSinkTaskThread.execute(WorkerSinkTaskThread.java:54)
at org.apache.kafka.connect.util.ShutdownableThread.run(ShutdownableThread.java:82)

I suspect something fundamental in the AWS configuration, but all the CLI options (aws s3 *) are working fine against the bucket specified in the *.properties file.
NOTE: the bucket is private to my account ... not world-accessible.

I tried to update to the latest AWS artifact (1.10.43 ... upgraded from 1.10.37 in your latest checkin). No improvement.

Add bzip2 support

This is because Hadoop/Spark systems can not distribute a job when data is loaded from GZip files. GZip is not a 'splittable' format. So for example in Spark, after loading a GZip file one has to repartition the RDD to split it line-by-line.
This is done automatically using the bzip2 format.

S3 is a common data source for Hadoop/Spark jobs (straightforward use case with AWS EMR) so having bzip2 support would be essential. Other data ingestion tools like Apache Flume supports bzip2 compression.

Kafka 0.10.0 and 0.10.1 are protocol-incompatible

With my PR #19 I moved the version of Kafka to 0.10.1.1. But I found that my production Kafka cluster was running 0.10.0.1, which has a consumer API that is incompatible with client 0.10.1.1. (I opened an SO question about the issue.)

So at this point, the master branch has Kafka 0.10.1.1, which does not seem to work with any version of Kafka prior to 0.10.1.0.

There's currently no way to allow selection of the version of Kafka. Might there be a standard mechanism for allowing this? Multiple pom.xml files, one for each version of Kafka?

Kafka 0.10.0.0+ support

I'm trying to run the plugin against Kafka 0.10.0.1 and i'm getting this error:

$ bin/connect-standalone.sh config/connect-standalone.properties example-connect-s3-sink.properties
[2016-10-03 17:30:26,809] INFO StandaloneConfig values: 
    cluster = connect
    rest.advertised.host.name = null
    task.shutdown.graceful.timeout.ms = 5000
    rest.host.name = null
    rest.advertised.port = null
    bootstrap.servers = [localhost:9092]
    offset.flush.timeout.ms = 5000
    offset.flush.interval.ms = 10000
    rest.port = 8083
    internal.key.converter = class org.apache.kafka.connect.json.JsonConverter
    access.control.allow.methods = 
    access.control.allow.origin = 
    offset.storage.file.filename = /tmp/connect.offsets
    internal.value.converter = class org.apache.kafka.connect.json.JsonConverter
    value.converter = class org.apache.kafka.connect.json.JsonConverter
    key.converter = class org.apache.kafka.connect.json.JsonConverter
 (org.apache.kafka.connect.runtime.standalone.StandaloneConfig:165)
[2016-10-03 17:30:26,935] INFO Logging initialized @345ms (org.eclipse.jetty.util.log:186)
[2016-10-03 17:30:27,113] INFO Kafka Connect starting (org.apache.kafka.connect.runtime.Connect:52)
[2016-10-03 17:30:27,114] INFO Herder starting (org.apache.kafka.connect.runtime.standalone.StandaloneHerder:71)
[2016-10-03 17:30:27,114] INFO Worker starting (org.apache.kafka.connect.runtime.Worker:102)
[2016-10-03 17:30:27,120] INFO ProducerConfig values: 
    compression.type = none
    metric.reporters = []
    metadata.max.age.ms = 300000
    metadata.fetch.timeout.ms = 60000
    reconnect.backoff.ms = 50
    sasl.kerberos.ticket.renew.window.factor = 0.8
    bootstrap.servers = [localhost:9092]
    retry.backoff.ms = 100
    sasl.kerberos.kinit.cmd = /usr/bin/kinit
    buffer.memory = 33554432
    timeout.ms = 30000
    key.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
    sasl.kerberos.service.name = null
    sasl.kerberos.ticket.renew.jitter = 0.05
    ssl.keystore.type = JKS
    ssl.trustmanager.algorithm = PKIX
    block.on.buffer.full = false
    ssl.key.password = null
    max.block.ms = 9223372036854775807
    sasl.kerberos.min.time.before.relogin = 60000
    connections.max.idle.ms = 540000
    ssl.truststore.password = null
    max.in.flight.requests.per.connection = 1
    metrics.num.samples = 2
    client.id = 
    ssl.endpoint.identification.algorithm = null
    ssl.protocol = TLS
    request.timeout.ms = 2147483647
    ssl.provider = null
    ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
    acks = all
    batch.size = 16384
    ssl.keystore.location = null
    receive.buffer.bytes = 32768
    ssl.cipher.suites = null
    ssl.truststore.type = JKS
    security.protocol = PLAINTEXT
    retries = 2147483647
    max.request.size = 1048576
    value.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
    ssl.truststore.location = null
    ssl.keystore.password = null
    ssl.keymanager.algorithm = SunX509
    metrics.sample.window.ms = 30000
    partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner
    send.buffer.bytes = 131072
    linger.ms = 0
 (org.apache.kafka.clients.producer.ProducerConfig:165)
[2016-10-03 17:30:27,140] INFO Kafka version : 0.9.0.1 (org.apache.kafka.common.utils.AppInfoParser:82)
[2016-10-03 17:30:27,140] INFO Kafka commitId : 23c69d62a0cabf06 (org.apache.kafka.common.utils.AppInfoParser:83)
[2016-10-03 17:30:27,141] INFO Starting FileOffsetBackingStore with file /tmp/connect.offsets (org.apache.kafka.connect.storage.FileOffsetBackingStore:60)
[2016-10-03 17:30:27,144] INFO Worker started (org.apache.kafka.connect.runtime.Worker:124)
[2016-10-03 17:30:27,144] INFO Herder started (org.apache.kafka.connect.runtime.standalone.StandaloneHerder:73)
[2016-10-03 17:30:27,145] INFO Starting REST server (org.apache.kafka.connect.runtime.rest.RestServer:98)
[2016-10-03 17:30:27,235] INFO jetty-9.2.15.v20160210 (org.eclipse.jetty.server.Server:327)
Oct 03, 2016 5:30:28 PM org.glassfish.jersey.internal.Errors logErrors
WARNING: The following warnings have been detected: WARNING: The (sub)resource method listConnectors in org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource contains empty path annotation.
WARNING: The (sub)resource method createConnector in org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource contains empty path annotation.
WARNING: The (sub)resource method listConnectorPlugins in org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource contains empty path annotation.
WARNING: The (sub)resource method serverInfo in org.apache.kafka.connect.runtime.rest.resources.RootResource contains empty path annotation.

[2016-10-03 17:30:28,069] INFO Started o.e.j.s.ServletContextHandler@5829e4f4{/,null,AVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler:744)
[2016-10-03 17:30:28,079] INFO Started ServerConnector@655ef322{HTTP/1.1}{0.0.0.0:8083} (org.eclipse.jetty.server.ServerConnector:266)
[2016-10-03 17:30:28,079] INFO Started @1492ms (org.eclipse.jetty.server.Server:379)
[2016-10-03 17:30:28,081] INFO REST server listening at http://127.0.0.1:8083/, advertising URL http://127.0.0.1:8083/ (org.apache.kafka.connect.runtime.rest.RestServer:150)
[2016-10-03 17:30:28,081] INFO Kafka Connect started (org.apache.kafka.connect.runtime.Connect:58)
[2016-10-03 17:30:28,083] ERROR Stopping after connector error (org.apache.kafka.connect.cli.ConnectStandalone:100)
java.lang.NoSuchMethodError: org.apache.kafka.common.config.ConfigDef.define(Ljava/lang/String;Lorg/apache/kafka/common/config/ConfigDef$Type;Lorg/apache/kafka/common/config/ConfigDef$Importance;Ljava/lang/String;Ljava/lang/String;ILorg/apache/kafka/common/config/ConfigDef$Width;Ljava/lang/String;)Lorg/apache/kafka/common/config/ConfigDef;
    at org.apache.kafka.connect.runtime.ConnectorConfig.configDef(ConnectorConfig.java:64)
    at org.apache.kafka.connect.runtime.ConnectorConfig.<init>(ConnectorConfig.java:75)
    at org.apache.kafka.connect.runtime.standalone.StandaloneHerder.startConnector(StandaloneHerder.java:246)
    at org.apache.kafka.connect.runtime.standalone.StandaloneHerder.putConnectorConfig(StandaloneHerder.java:164)
    at org.apache.kafka.connect.cli.ConnectStandalone.main(ConnectStandalone.java:94)
[2016-10-03 17:30:28,087] INFO Kafka Connect stopping (org.apache.kafka.connect.runtime.Connect:68)
[2016-10-03 17:30:28,088] INFO Stopping REST server (org.apache.kafka.connect.runtime.rest.RestServer:154)
[2016-10-03 17:30:28,092] INFO Stopped ServerConnector@655ef322{HTTP/1.1}{0.0.0.0:8083} (org.eclipse.jetty.server.ServerConnector:306)
[2016-10-03 17:30:28,101] INFO Stopped o.e.j.s.ServletContextHandler@5829e4f4{/,null,UNAVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler:865)
[2016-10-03 17:30:28,103] INFO REST server stopped (org.apache.kafka.connect.runtime.rest.RestServer:165)
[2016-10-03 17:30:28,103] INFO Herder stopping (org.apache.kafka.connect.runtime.standalone.StandaloneHerder:77)
[2016-10-03 17:30:28,104] INFO Worker stopping (org.apache.kafka.connect.runtime.Worker:128)
[2016-10-03 17:30:28,104] WARN Shutting down tasks [] uncleanly; herder should have shut down tasks before the Worker is stopped. (org.apache.kafka.connect.runtime.Worker:141)
[2016-10-03 17:30:28,104] INFO Stopped FileOffsetBackingStore (org.apache.kafka.connect.storage.FileOffsetBackingStore:68)
[2016-10-03 17:30:28,104] INFO Worker stopped (org.apache.kafka.connect.runtime.Worker:151)
[2016-10-03 17:30:29,093] INFO Reflections took 1910 ms to scan 61 urls, producing 3338 keys and 24145 values  (org.reflections.Reflections:229)
[2016-10-03 17:30:29,100] INFO Herder stopped (org.apache.kafka.connect.runtime.standalone.StandaloneHerder:91)
[2016-10-03 17:30:29,100] INFO Kafka Connect stopped (org.apache.kafka.connect.runtime.Connect:73)

It seems that "org.apache.kafka.common.config.ConfigDef.define" method is missing in the new version..

Restore from s3 backup?

If kafka-connect-s3 successfully backs up some messages to S3, what's a good way to replay these messages to a configured topic?

Does kafka-connect-s3 include a restoreFromS3(String bucket) or similar function? If not, could one be added?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble