wix-incubator / kafka-connect-s3 Goto Github PK
View Code? Open in Web Editor NEWA Kafka-Connect Sink for S3 with no Hadoop dependencies.
License: Other
A Kafka-Connect Sink for S3 with no Hadoop dependencies.
License: Other
Hi, I am running into the a issue trying to use kafka-connect-s3.
I specified a local buffer directory with
local.buffer.dir=/tmp/connect-system-test
but those files keep growing and does not get automatically clean up
is there any other setting I should specify?
Thanks
The connector produces tiny files on S3, in some cases one message per file.
I suspect it's because the file name in every batch is different so it doesn't know that it should append to the previous file.
Once the jar is built locally on mac. With all the tests passing. How do I run this App?. There are no instructions specific to it in Docs. As I can see local-run.sh is missing
Sorry for confusion: I am running this command: connect-standalone /usr/local/etc/kafka/connect-standalone.properties connect-s3-sink.properties
Unfotunately I get this error:Error: Could not find or load main class org.apache.kafka.connect.cli.ConnectStandalone
When I submit an s3 sink to my Kafka Connect cluster, the logs show an error about the S3 key /last_chunk_index.mytopic-00000.txt
being missing.
Am I supposed to manually create this file in S3, or is the sink supposed to create this file?
kafka-connect-s3 currently includes information about Kafka partitions and offsets when formatting the S3 object name, correct?
My use case is more timestamp-oriented and less partition/offset-oriented. Could kafka-connect-s3 generalize the S3 object naming schema, such as allowing configuration to specify a class implementing an interface like String s3ObjectName(byte[] message)
? That way, users could customize the S3 name to suit their backup system needs better.
I'm attempting to test the Sink connector in the latest Kafka 0.9.0.1 framework. When attempting to launch a standalone connect worker task with the S3 sink configured in it (using the example*.properties files modified for my environment, the following exception is thrown at startup
Exception in thread "WorkerSinkTask-s3-sink-0" java.lang.NoSuchFieldError: INSTANCE
at com.amazonaws.http.conn.SdkConnectionKeepAliveStrategy.getKeepAliveDuration(SdkConnectionKeepAliveStrategy.java:48)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:535)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:822)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:576)
at com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:362)
at com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:328)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:307)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3643)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1148)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1037)
at com.deviantart.kafka_connect_s3.S3Writer.fetchOffset(S3Writer.java:100)
at com.deviantart.kafka_connect_s3.S3SinkTask.recoverPartition(S3SinkTask.java:193)
at com.deviantart.kafka_connect_s3.S3SinkTask.recoverAssignment(S3SinkTask.java:181)
at com.deviantart.kafka_connect_s3.S3SinkTask.start(S3SinkTask.java:84)
at org.apache.kafka.connect.runtime.WorkerSinkTask.joinConsumerGroupAndStart(WorkerSinkTask.java:154)
at org.apache.kafka.connect.runtime.WorkerSinkTaskThread.execute(WorkerSinkTaskThread.java:54)
at org.apache.kafka.connect.util.ShutdownableThread.run(ShutdownableThread.java:82)
I suspect something fundamental in the AWS configuration, but all the CLI options (aws s3 *) are working fine against the bucket specified in the *.properties file.
NOTE: the bucket is private to my account ... not world-accessible.
I tried to update to the latest AWS artifact (1.10.43 ... upgraded from 1.10.37 in your latest checkin). No improvement.
This is because Hadoop/Spark systems can not distribute a job when data is loaded from GZip files. GZip is not a 'splittable' format. So for example in Spark, after loading a GZip file one has to repartition the RDD to split it line-by-line.
This is done automatically using the bzip2 format.
S3 is a common data source for Hadoop/Spark jobs (straightforward use case with AWS EMR) so having bzip2 support would be essential. Other data ingestion tools like Apache Flume supports bzip2 compression.
With my PR #19 I moved the version of Kafka to 0.10.1.1. But I found that my production Kafka cluster was running 0.10.0.1, which has a consumer API that is incompatible with client 0.10.1.1. (I opened an SO question about the issue.)
So at this point, the master branch has Kafka 0.10.1.1, which does not seem to work with any version of Kafka prior to 0.10.1.0.
There's currently no way to allow selection of the version of Kafka. Might there be a standard mechanism for allowing this? Multiple pom.xml
files, one for each version of Kafka?
I'm trying to run the plugin against Kafka 0.10.0.1 and i'm getting this error:
$ bin/connect-standalone.sh config/connect-standalone.properties example-connect-s3-sink.properties
[2016-10-03 17:30:26,809] INFO StandaloneConfig values:
cluster = connect
rest.advertised.host.name = null
task.shutdown.graceful.timeout.ms = 5000
rest.host.name = null
rest.advertised.port = null
bootstrap.servers = [localhost:9092]
offset.flush.timeout.ms = 5000
offset.flush.interval.ms = 10000
rest.port = 8083
internal.key.converter = class org.apache.kafka.connect.json.JsonConverter
access.control.allow.methods =
access.control.allow.origin =
offset.storage.file.filename = /tmp/connect.offsets
internal.value.converter = class org.apache.kafka.connect.json.JsonConverter
value.converter = class org.apache.kafka.connect.json.JsonConverter
key.converter = class org.apache.kafka.connect.json.JsonConverter
(org.apache.kafka.connect.runtime.standalone.StandaloneConfig:165)
[2016-10-03 17:30:26,935] INFO Logging initialized @345ms (org.eclipse.jetty.util.log:186)
[2016-10-03 17:30:27,113] INFO Kafka Connect starting (org.apache.kafka.connect.runtime.Connect:52)
[2016-10-03 17:30:27,114] INFO Herder starting (org.apache.kafka.connect.runtime.standalone.StandaloneHerder:71)
[2016-10-03 17:30:27,114] INFO Worker starting (org.apache.kafka.connect.runtime.Worker:102)
[2016-10-03 17:30:27,120] INFO ProducerConfig values:
compression.type = none
metric.reporters = []
metadata.max.age.ms = 300000
metadata.fetch.timeout.ms = 60000
reconnect.backoff.ms = 50
sasl.kerberos.ticket.renew.window.factor = 0.8
bootstrap.servers = [localhost:9092]
retry.backoff.ms = 100
sasl.kerberos.kinit.cmd = /usr/bin/kinit
buffer.memory = 33554432
timeout.ms = 30000
key.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
ssl.keystore.type = JKS
ssl.trustmanager.algorithm = PKIX
block.on.buffer.full = false
ssl.key.password = null
max.block.ms = 9223372036854775807
sasl.kerberos.min.time.before.relogin = 60000
connections.max.idle.ms = 540000
ssl.truststore.password = null
max.in.flight.requests.per.connection = 1
metrics.num.samples = 2
client.id =
ssl.endpoint.identification.algorithm = null
ssl.protocol = TLS
request.timeout.ms = 2147483647
ssl.provider = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
acks = all
batch.size = 16384
ssl.keystore.location = null
receive.buffer.bytes = 32768
ssl.cipher.suites = null
ssl.truststore.type = JKS
security.protocol = PLAINTEXT
retries = 2147483647
max.request.size = 1048576
value.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
ssl.truststore.location = null
ssl.keystore.password = null
ssl.keymanager.algorithm = SunX509
metrics.sample.window.ms = 30000
partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner
send.buffer.bytes = 131072
linger.ms = 0
(org.apache.kafka.clients.producer.ProducerConfig:165)
[2016-10-03 17:30:27,140] INFO Kafka version : 0.9.0.1 (org.apache.kafka.common.utils.AppInfoParser:82)
[2016-10-03 17:30:27,140] INFO Kafka commitId : 23c69d62a0cabf06 (org.apache.kafka.common.utils.AppInfoParser:83)
[2016-10-03 17:30:27,141] INFO Starting FileOffsetBackingStore with file /tmp/connect.offsets (org.apache.kafka.connect.storage.FileOffsetBackingStore:60)
[2016-10-03 17:30:27,144] INFO Worker started (org.apache.kafka.connect.runtime.Worker:124)
[2016-10-03 17:30:27,144] INFO Herder started (org.apache.kafka.connect.runtime.standalone.StandaloneHerder:73)
[2016-10-03 17:30:27,145] INFO Starting REST server (org.apache.kafka.connect.runtime.rest.RestServer:98)
[2016-10-03 17:30:27,235] INFO jetty-9.2.15.v20160210 (org.eclipse.jetty.server.Server:327)
Oct 03, 2016 5:30:28 PM org.glassfish.jersey.internal.Errors logErrors
WARNING: The following warnings have been detected: WARNING: The (sub)resource method listConnectors in org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource contains empty path annotation.
WARNING: The (sub)resource method createConnector in org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource contains empty path annotation.
WARNING: The (sub)resource method listConnectorPlugins in org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource contains empty path annotation.
WARNING: The (sub)resource method serverInfo in org.apache.kafka.connect.runtime.rest.resources.RootResource contains empty path annotation.
[2016-10-03 17:30:28,069] INFO Started o.e.j.s.ServletContextHandler@5829e4f4{/,null,AVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler:744)
[2016-10-03 17:30:28,079] INFO Started ServerConnector@655ef322{HTTP/1.1}{0.0.0.0:8083} (org.eclipse.jetty.server.ServerConnector:266)
[2016-10-03 17:30:28,079] INFO Started @1492ms (org.eclipse.jetty.server.Server:379)
[2016-10-03 17:30:28,081] INFO REST server listening at http://127.0.0.1:8083/, advertising URL http://127.0.0.1:8083/ (org.apache.kafka.connect.runtime.rest.RestServer:150)
[2016-10-03 17:30:28,081] INFO Kafka Connect started (org.apache.kafka.connect.runtime.Connect:58)
[2016-10-03 17:30:28,083] ERROR Stopping after connector error (org.apache.kafka.connect.cli.ConnectStandalone:100)
java.lang.NoSuchMethodError: org.apache.kafka.common.config.ConfigDef.define(Ljava/lang/String;Lorg/apache/kafka/common/config/ConfigDef$Type;Lorg/apache/kafka/common/config/ConfigDef$Importance;Ljava/lang/String;Ljava/lang/String;ILorg/apache/kafka/common/config/ConfigDef$Width;Ljava/lang/String;)Lorg/apache/kafka/common/config/ConfigDef;
at org.apache.kafka.connect.runtime.ConnectorConfig.configDef(ConnectorConfig.java:64)
at org.apache.kafka.connect.runtime.ConnectorConfig.<init>(ConnectorConfig.java:75)
at org.apache.kafka.connect.runtime.standalone.StandaloneHerder.startConnector(StandaloneHerder.java:246)
at org.apache.kafka.connect.runtime.standalone.StandaloneHerder.putConnectorConfig(StandaloneHerder.java:164)
at org.apache.kafka.connect.cli.ConnectStandalone.main(ConnectStandalone.java:94)
[2016-10-03 17:30:28,087] INFO Kafka Connect stopping (org.apache.kafka.connect.runtime.Connect:68)
[2016-10-03 17:30:28,088] INFO Stopping REST server (org.apache.kafka.connect.runtime.rest.RestServer:154)
[2016-10-03 17:30:28,092] INFO Stopped ServerConnector@655ef322{HTTP/1.1}{0.0.0.0:8083} (org.eclipse.jetty.server.ServerConnector:306)
[2016-10-03 17:30:28,101] INFO Stopped o.e.j.s.ServletContextHandler@5829e4f4{/,null,UNAVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler:865)
[2016-10-03 17:30:28,103] INFO REST server stopped (org.apache.kafka.connect.runtime.rest.RestServer:165)
[2016-10-03 17:30:28,103] INFO Herder stopping (org.apache.kafka.connect.runtime.standalone.StandaloneHerder:77)
[2016-10-03 17:30:28,104] INFO Worker stopping (org.apache.kafka.connect.runtime.Worker:128)
[2016-10-03 17:30:28,104] WARN Shutting down tasks [] uncleanly; herder should have shut down tasks before the Worker is stopped. (org.apache.kafka.connect.runtime.Worker:141)
[2016-10-03 17:30:28,104] INFO Stopped FileOffsetBackingStore (org.apache.kafka.connect.storage.FileOffsetBackingStore:68)
[2016-10-03 17:30:28,104] INFO Worker stopped (org.apache.kafka.connect.runtime.Worker:151)
[2016-10-03 17:30:29,093] INFO Reflections took 1910 ms to scan 61 urls, producing 3338 keys and 24145 values (org.reflections.Reflections:229)
[2016-10-03 17:30:29,100] INFO Herder stopped (org.apache.kafka.connect.runtime.standalone.StandaloneHerder:91)
[2016-10-03 17:30:29,100] INFO Kafka Connect stopped (org.apache.kafka.connect.runtime.Connect:73)
It seems that "org.apache.kafka.common.config.ConfigDef.define" method is missing in the new version..
If kafka-connect-s3 successfully backs up some messages to S3, what's a good way to replay these messages to a configured topic?
Does kafka-connect-s3 include a restoreFromS3(String bucket)
or similar function? If not, could one be added?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.