wix-incubator / kafka-connect-s3 Goto Github PK

View Code? Open in Web Editor NEW

56.0 64.0 45.0 58 KB

A Kafka-Connect Sink for S3 with no Hadoop dependencies.

License: Other

Java 73.99% Python 26.01%

kafka-connect-s3's Introduction

Kafka Connect S3 Sink

This is a kafka-connect sink for Amazon S3, without any dependency on HDFS/Hadoop libraries or data formats.

Status

This is pre-production code. Use at your own risk.

That said, we've put some effort into a reasonably thorough test suite and will be putting it into production shortly. We will update this notice when we have it running smoothly in production.

If you use it, you will likely find issues or ways it can be improved. Please feel free to create pull requests/issues and we will do our best to merge anything that helps overall (especially if it has passing tests ;)).

This was built against Kafka 0.10.1.1.

Block-GZIP Output Format

For now there is just one output format which is essentially just a GZIPed text file with one Kafka message per line.

It's actually a little more sophisticated than that though. We exploit a property of GZIP whereby multiple GZIP encoded files can be concatenated to produce a single file. Such a concatenated file is a valid GZIP file in its own right and will be decompressed by any GZIP implementation to a single stream of lines -- exactly as if the input files were concatenated first and compressed together.

Rationale

This allows us to keep chunk files large which is good for the most common batch/ETL workloads, while enabling relatively efficient access if needed.

If we sized each file at say 64MB then operating on a week's data might require downloading 10s or 100s of thousands of separate S3 objects which has non-trivial overhead compared to a few hundred. It also costs more to list bucket contents when you have millions of objects for a relatively short time period since S3 list operations can only return 1000 at once. So we wanted chunks around 1GB in size for most cases.

But in some rarer cases, we might want to resume processing from a specific offset in middle of a block, or pull just a few records at a specific offset. Rather than pull the whole 1GB chunk each time we need that, this scheme allows us to quickly find a much smaller chunk to satisfy the read.

We use a chunk threshold of 64MB by default although it's configurable. Keep in mind this is 64MB of uncompressed input records, so the actual block within the file is likely to be significantly smaller depending on how compressible your data is.

Also note that while we don't anticipate random reads being common, this format is very little extra code and virtually no size/performance overhead compared with just GZIPing big chunks so it's a neat option to have.

Example

We output 2 files per chunk

$ tree system_test/data/connect-system-test/systest/2016-02-24/
system_test/data/connect-system-test/systest/2016-02-24/
├── system-test-00000-000000000000.gz
├── system-test-00000-000000000000.index.json
├── system-test-00000-000000000100.gz
├── system-test-00000-000000000100.index.json
├── system-test-00000-000000000200.gz
├── system-test-00000-000000000200.index.json
├── system-test-00001-000000000000.gz
├── system-test-00001-000000000000.index.json
├── system-test-00002-000000000000.gz
└── system-test-00002-000000000000.index.json

the actual *.gz file which can be read and treated as a plain old gzip file on it's own
a *.index.json file which described the byte positions of each "block" inside the file

If you don't care about seeking to specific offsets efficiently then ignore the index files and just use the *.gz as if it's a plain old gzip file.

Note file name format is: <topic name>-<zero-padded partition>-<zero-padded offset of first record>.*. That implies that if you exceed 10k partitions in a topic or a trillion message in a single partition, the files will no longer sort naturally. In practice that is probably not a big deal anyway since we prefix with upload date too to make listing recent files easier. Making padding length configurable is an option. It's mostly makes things simpler to eyeball with low numbers where powers of ten change fast anyway.

If you want to have somewhat efficient seeking to particular offset though, you can do it like this:

List bucket contents and locate the chunk that the offset is in
Download the *.index.json file for that chunk, it looks something like this (note these are artificially small chunks):

$ cat system-test-00000-000000000000.index.json | jq -M '.'
{
  "chunks": [
    {
      "byte_length_uncompressed": 2890,
      "num_records": 100,
      "byte_length": 275,
      "byte_offset": 0,
      "first_record_offset": 0
    },
    {
      "byte_length_uncompressed": 3121,
      "num_records": 123,
      "byte_length": 325,
      "byte_offset": 275,
      "first_record_offset": 100
    },
    ...
  ]
}

Iterate through the "chunks" described in the index. Each has a first_record_offset and num_records so you can work out if the offset you want to find is in that chunk.
first_record_offset is the absolute kafka topic-partition offset of the first message in the chunk. Hence the first chunk in the index will always have the same first_record_offset as the offset in the file name - 0 in this case.
When you've found the correct chunk, use the byte_offset and byte_length fields to make a range request to S3 to download only the block you care about.
Depending on your needs you can either limit to just the single block, or if you want to consume all records after that offset, you can consume from the offset right to the end of the file
The range request bytes can be decompressed as a GZIP file on their own with any GZIP compatible tool, provided you limit to whole block boundaries.

Other Formats

For now we only support Block-GZIP output. This assumes that all your kafka messages can be output as newline-delimited text files.

We could make the output format pluggable if others have use for this connector, but need binary serialisation formats like Avro/Thrift/Protobuf etc. Pull requests welcome.

Build and Run

You should be able to build this with mvn package. Once the jar is generated in target folder include it in CLASSPATH (ex: for Mac users,export CLASSPATH=.:$CLASSPATH:/fullpath/to/kafka-connect-s3-jar )

Run: bin/connect-standalone.sh example-connect-worker.properties example-connect-s3-sink.properties(from the root directory of project, make sure you have kafka on the path, if not then give full path of kafka before bin)

There is a script local-run.sh which you can inspect to see how to get it running. This script relies on having a local kafka instance setup as described in testing section below.

Configuration

In addition to the standard kafka-connect config options we support/require the following, in the task properties file or REST config:

Config Key	Default	Notes
s3.bucket	REQUIRED	The name of the bucket to write too.
local.buffer.dir	REQUIRED	Directory to store buffered data in. Must exist.
s3.prefix	`""`	Prefix added to all object keys stored in bucket to "namespace" them.
s3.endpoint	AWS defaults per region	Mostly useful for testing.
s3.path_style	`false`	Force path-style access to bucket rather than subdomain. Mostly useful for tests.
compressed_block_size	67108864	How much uncompressed data to write to the file before we rol to a new block/chunk. See Block-GZIP section above.

Note that we use the default AWS SDK credentials provider. Refer to their docs for the options for configuring S3 credentials.

Testing

Most of the custom logic for handling output formatting, and managing S3 has reasonable mocked unit tests. There are probably improvements that can be made, but the logic is not especially complex.

There is also a basic system test to validate the integration with kafka-connect. This is not complete nor is it 100% deterministic due to the vagaries of multiple systems with non-deterministic things like timers effecting behaviour.

But it does consistently pass when run by hand on my Mac and validates basic operation of:

Initialisation and consuming/flushing all expected data
Resuming correctly on restart based on S3 state not relying on local disk state
Reconfiguring partitions of a topic and correctly resuming each

It doesn't test distributed mode operation yet, however the above is enough to exercise all of the integration points with the kafka-connect runtime.

System Test Setup

See the README in the system_test dir for details on setting up dependencies and environment to run the tests.

kafka-connect-s3's People

Contributors

Stargazers

Watchers

kafka-connect-s3's Issues

Uncaught exception for AWS S3 connection

I'm attempting to test the Sink connector in the latest Kafka 0.9.0.1 framework. When attempting to launch a standalone connect worker task with the S3 sink configured in it (using the example*.properties files modified for my environment, the following exception is thrown at startup

Exception in thread "WorkerSinkTask-s3-sink-0" java.lang.NoSuchFieldError: INSTANCE
at com.amazonaws.http.conn.SdkConnectionKeepAliveStrategy.getKeepAliveDuration(SdkConnectionKeepAliveStrategy.java:48)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:535)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:822)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:576)
at com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:362)
at com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:328)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:307)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3643)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1148)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1037)
at com.deviantart.kafka_connect_s3.S3Writer.fetchOffset(S3Writer.java:100)
at com.deviantart.kafka_connect_s3.S3SinkTask.recoverPartition(S3SinkTask.java:193)
at com.deviantart.kafka_connect_s3.S3SinkTask.recoverAssignment(S3SinkTask.java:181)
at com.deviantart.kafka_connect_s3.S3SinkTask.start(S3SinkTask.java:84)
at org.apache.kafka.connect.runtime.WorkerSinkTask.joinConsumerGroupAndStart(WorkerSinkTask.java:154)
at org.apache.kafka.connect.runtime.WorkerSinkTaskThread.execute(WorkerSinkTaskThread.java:54)
at org.apache.kafka.connect.util.ShutdownableThread.run(ShutdownableThread.java:82)

I suspect something fundamental in the AWS configuration, but all the CLI options (aws s3 *) are working fine against the bucket specified in the *.properties file.
NOTE: the bucket is private to my account ... not world-accessible.

I tried to update to the latest AWS artifact (1.10.43 ... upgraded from 1.10.37 in your latest checkin). No improvement.

More flexible S3 object naming scheme

kafka-connect-s3 currently includes information about Kafka partitions and offsets when formatting the S3 object name, correct?

My use case is more timestamp-oriented and less partition/offset-oriented. Could kafka-connect-s3 generalize the S3 object naming schema, such as allowing configuration to specify a class implementing an interface like String s3ObjectName(byte[] message)? That way, users could customize the S3 name to suit their backup system needs better.

the local buffer does not automatically clean up

Hi, I am running into the a issue trying to use kafka-connect-s3.

I specified a local buffer directory with
local.buffer.dir=/tmp/connect-system-test
but those files keep growing and does not get automatically clean up

is there any other setting I should specify?
Thanks

Restore from s3 backup?

If kafka-connect-s3 successfully backs up some messages to S3, what's a good way to replay these messages to a configured topic?

Does kafka-connect-s3 include a restoreFromS3(String bucket) or similar function? If not, could one be added?

compressed_block_size not working

The connector produces tiny files on S3, in some cases one message per file.
I suspect it's because the file name in every batch is different so it doesn't know that it should append to the previous file.

Add bzip2 support

This is because Hadoop/Spark systems can not distribute a job when data is loaded from GZip files. GZip is not a 'splittable' format. So for example in Spark, after loading a GZip file one has to repartition the RDD to split it line-by-line.
This is done automatically using the bzip2 format.

S3 is a common data source for Hadoop/Spark jobs (straightforward use case with AWS EMR) so having bzip2 support would be essential. Other data ingestion tools like Apache Flume supports bzip2 compression.

AmazonS3Exception: The specified key does not exist

When I submit an s3 sink to my Kafka Connect cluster, the logs show an error about the S3 key /last_chunk_index.mytopic-00000.txt being missing.

Am I supposed to manually create this file in S3, or is the sink supposed to create this file?

Kafka 0.10.0.0+ support

I'm trying to run the plugin against Kafka 0.10.0.1 and i'm getting this error:

$ bin/connect-standalone.sh config/connect-standalone.properties example-connect-s3-sink.properties
[2016-10-03 17:30:26,809] INFO StandaloneConfig values: 
    cluster = connect
    rest.advertised.host.name = null
    task.shutdown.graceful.timeout.ms = 5000
    rest.host.name = null
    rest.advertised.port = null
    bootstrap.servers = [localhost:9092]
    offset.flush.timeout.ms = 5000
    offset.flush.interval.ms = 10000
    rest.port = 8083
    internal.key.converter = class org.apache.kafka.connect.json.JsonConverter
    access.control.allow.methods = 
    access.control.allow.origin = 
    offset.storage.file.filename = /tmp/connect.offsets
    internal.value.converter = class org.apache.kafka.connect.json.JsonConverter
    value.converter = class org.apache.kafka.connect.json.JsonConverter
    key.converter = class org.apache.kafka.connect.json.JsonConverter
 (org.apache.kafka.connect.runtime.standalone.StandaloneConfig:165)
[2016-10-03 17:30:26,935] INFO Logging initialized @345ms (org.eclipse.jetty.util.log:186)
[2016-10-03 17:30:27,113] INFO Kafka Connect starting (org.apache.kafka.connect.runtime.Connect:52)
[2016-10-03 17:30:27,114] INFO Herder starting (org.apache.kafka.connect.runtime.standalone.StandaloneHerder:71)
[2016-10-03 17:30:27,114] INFO Worker starting (org.apache.kafka.connect.runtime.Worker:102)
[2016-10-03 17:30:27,120] INFO ProducerConfig values: 
    compression.type = none
    metric.reporters = []
    metadata.max.age.ms = 300000
    metadata.fetch.timeout.ms = 60000
    reconnect.backoff.ms = 50
    sasl.kerberos.ticket.renew.window.factor = 0.8
    bootstrap.servers = [localhost:9092]
    retry.backoff.ms = 100
    sasl.kerberos.kinit.cmd = /usr/bin/kinit
    buffer.memory = 33554432
    timeout.ms = 30000
    key.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
    sasl.kerberos.service.name = null
    sasl.kerberos.ticket.renew.jitter = 0.05
    ssl.keystore.type = JKS
    ssl.trustmanager.algorithm = PKIX
    block.on.buffer.full = false
    ssl.key.password = null
    max.block.ms = 9223372036854775807
    sasl.kerberos.min.time.before.relogin = 60000
    connections.max.idle.ms = 540000
    ssl.truststore.password = null
    max.in.flight.requests.per.connection = 1
    metrics.num.samples = 2
    client.id = 
    ssl.endpoint.identification.algorithm = null
    ssl.protocol = TLS
    request.timeout.ms = 2147483647
    ssl.provider = null
    ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
    acks = all
    batch.size = 16384
    ssl.keystore.location = null
    receive.buffer.bytes = 32768
    ssl.cipher.suites = null
    ssl.truststore.type = JKS
    security.protocol = PLAINTEXT
    retries = 2147483647
    max.request.size = 1048576
    value.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
    ssl.truststore.location = null
    ssl.keystore.password = null
    ssl.keymanager.algorithm = SunX509
    metrics.sample.window.ms = 30000
    partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner
    send.buffer.bytes = 131072
    linger.ms = 0
 (org.apache.kafka.clients.producer.ProducerConfig:165)
[2016-10-03 17:30:27,140] INFO Kafka version : 0.9.0.1 (org.apache.kafka.common.utils.AppInfoParser:82)
[2016-10-03 17:30:27,140] INFO Kafka commitId : 23c69d62a0cabf06 (org.apache.kafka.common.utils.AppInfoParser:83)
[2016-10-03 17:30:27,141] INFO Starting FileOffsetBackingStore with file /tmp/connect.offsets (org.apache.kafka.connect.storage.FileOffsetBackingStore:60)
[2016-10-03 17:30:27,144] INFO Worker started (org.apache.kafka.connect.runtime.Worker:124)
[2016-10-03 17:30:27,144] INFO Herder started (org.apache.kafka.connect.runtime.standalone.StandaloneHerder:73)
[2016-10-03 17:30:27,145] INFO Starting REST server (org.apache.kafka.connect.runtime.rest.RestServer:98)
[2016-10-03 17:30:27,235] INFO jetty-9.2.15.v20160210 (org.eclipse.jetty.server.Server:327)
Oct 03, 2016 5:30:28 PM org.glassfish.jersey.internal.Errors logErrors
WARNING: The following warnings have been detected: WARNING: The (sub)resource method listConnectors in org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource contains empty path annotation.
WARNING: The (sub)resource method createConnector in org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource contains empty path annotation.
WARNING: The (sub)resource method listConnectorPlugins in org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource contains empty path annotation.
WARNING: The (sub)resource method serverInfo in org.apache.kafka.connect.runtime.rest.resources.RootResource contains empty path annotation.

[2016-10-03 17:30:28,069] INFO Started o.e.j.s.ServletContextHandler@5829e4f4{/,null,AVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler:744)
[2016-10-03 17:30:28,079] INFO Started ServerConnector@655ef322{HTTP/1.1}{0.0.0.0:8083} (org.eclipse.jetty.server.ServerConnector:266)
[2016-10-03 17:30:28,079] INFO Started @1492ms (org.eclipse.jetty.server.Server:379)
[2016-10-03 17:30:28,081] INFO REST server listening at http://127.0.0.1:8083/, advertising URL http://127.0.0.1:8083/ (org.apache.kafka.connect.runtime.rest.RestServer:150)
[2016-10-03 17:30:28,081] INFO Kafka Connect started (org.apache.kafka.connect.runtime.Connect:58)
[2016-10-03 17:30:28,083] ERROR Stopping after connector error (org.apache.kafka.connect.cli.ConnectStandalone:100)
java.lang.NoSuchMethodError: org.apache.kafka.common.config.ConfigDef.define(Ljava/lang/String;Lorg/apache/kafka/common/config/ConfigDef$Type;Lorg/apache/kafka/common/config/ConfigDef$Importance;Ljava/lang/String;Ljava/lang/String;ILorg/apache/kafka/common/config/ConfigDef$Width;Ljava/lang/String;)Lorg/apache/kafka/common/config/ConfigDef;
    at org.apache.kafka.connect.runtime.ConnectorConfig.configDef(ConnectorConfig.java:64)
    at org.apache.kafka.connect.runtime.ConnectorConfig.<init>(ConnectorConfig.java:75)
    at org.apache.kafka.connect.runtime.standalone.StandaloneHerder.startConnector(StandaloneHerder.java:246)
    at org.apache.kafka.connect.runtime.standalone.StandaloneHerder.putConnectorConfig(StandaloneHerder.java:164)
    at org.apache.kafka.connect.cli.ConnectStandalone.main(ConnectStandalone.java:94)
[2016-10-03 17:30:28,087] INFO Kafka Connect stopping (org.apache.kafka.connect.runtime.Connect:68)
[2016-10-03 17:30:28,088] INFO Stopping REST server (org.apache.kafka.connect.runtime.rest.RestServer:154)
[2016-10-03 17:30:28,092] INFO Stopped ServerConnector@655ef322{HTTP/1.1}{0.0.0.0:8083} (org.eclipse.jetty.server.ServerConnector:306)
[2016-10-03 17:30:28,101] INFO Stopped o.e.j.s.ServletContextHandler@5829e4f4{/,null,UNAVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler:865)
[2016-10-03 17:30:28,103] INFO REST server stopped (org.apache.kafka.connect.runtime.rest.RestServer:165)
[2016-10-03 17:30:28,103] INFO Herder stopping (org.apache.kafka.connect.runtime.standalone.StandaloneHerder:77)
[2016-10-03 17:30:28,104] INFO Worker stopping (org.apache.kafka.connect.runtime.Worker:128)
[2016-10-03 17:30:28,104] WARN Shutting down tasks [] uncleanly; herder should have shut down tasks before the Worker is stopped. (org.apache.kafka.connect.runtime.Worker:141)
[2016-10-03 17:30:28,104] INFO Stopped FileOffsetBackingStore (org.apache.kafka.connect.storage.FileOffsetBackingStore:68)
[2016-10-03 17:30:28,104] INFO Worker stopped (org.apache.kafka.connect.runtime.Worker:151)
[2016-10-03 17:30:29,093] INFO Reflections took 1910 ms to scan 61 urls, producing 3338 keys and 24145 values  (org.reflections.Reflections:229)
[2016-10-03 17:30:29,100] INFO Herder stopped (org.apache.kafka.connect.runtime.standalone.StandaloneHerder:91)
[2016-10-03 17:30:29,100] INFO Kafka Connect stopped (org.apache.kafka.connect.runtime.Connect:73)

It seems that "org.apache.kafka.common.config.ConfigDef.define" method is missing in the new version..

Kafka 0.10.0 and 0.10.1 are protocol-incompatible

With my PR #19 I moved the version of Kafka to 0.10.1.1. But I found that my production Kafka cluster was running 0.10.0.1, which has a consumer API that is incompatible with client 0.10.1.1. (I opened an SO question about the issue.)

So at this point, the master branch has Kafka 0.10.1.1, which does not seem to work with any version of Kafka prior to 0.10.1.0.

There's currently no way to allow selection of the version of Kafka. Might there be a standard mechanism for allowing this? Multiple pom.xml files, one for each version of Kafka?

How do I run generated jar..?

Once the jar is built locally on mac. With all the tests passing. How do I run this App?. There are no instructions specific to it in Docs. As I can see local-run.sh is missing

Sorry for confusion: I am running this command: connect-standalone /usr/local/etc/kafka/connect-standalone.properties connect-s3-sink.properties
Unfotunately I get this error:Error: Could not find or load main class org.apache.kafka.connect.cli.ConnectStandalone