GithubHelp home page GithubHelp logo

lampda / onyx-kafka Goto Github PK

View Code? Open in Web Editor NEW

This project forked from onyx-platform/onyx-kafka

0.0 1.0 0.0 610 KB

Onyx plugin for Kafka

License: Eclipse Public License 1.0

Clojure 99.45% Shell 0.55%

onyx-kafka's Introduction

onyx-kafka

Onyx plugin providing read and write facilities for Kafka. This plugin automatically discovers broker locations from ZooKeeper and updates the consumers when there is a broker failover.

This plugin version is only compatible with Kafka 0.9+. Please use onyx-kafka-0.8 with Kafka 0.8.

Installation

In your project file:

[org.onyxplatform/onyx-kafka "0.9.15.0"]

In your peer boot-up namespace:

(:require [onyx.plugin.kafka])

Functions

read-messages

Reads segments from a Kafka topic. Peers will automatically be assigned to each of the topics partitions, unless :kafka/partition is supplied in which case only one partition will be read from. :onyx/min-peers and :onyx/max-peers must be used to fix the number of the peers for the task to the number of partitions read by the task.

NOTE: The :done sentinel (i.e. batch processing) is not supported if more than one partition is auto-assigned i.e. the topic has more than one partition and :kafka/partition is not fixed. An exception will be thrown if a :done is read under this circumstance.

Catalog entry:

{:onyx/name :read-messages
 :onyx/plugin :onyx.plugin.kafka/read-messages
 :onyx/type :input
 :onyx/medium :kafka
 :kafka/topic "my topic"
 :kafka/group-id "onyx-consumer"
 :kafka/receive-buffer-bytes 65536
 :kafka/zookeeper "127.0.0.1:2181"
 :kafka/offset-reset :earliest
 :kafka/force-reset? true
 :kafka/commit-interval 500
 :kafka/deserializer-fn :my.ns/deserializer-fn
 :kafka/wrap-with-metadata? false
 ;; :kafka/start-offsets {p1 offset1, p2, offset2}
 :onyx/batch-timeout 50
 :onyx/min-peers <<NUMBER-OF-PARTITIONS>>
 :onyx/max-peers <<NUMBER-OF-PARTITIONS>>
 :onyx/batch-size 100
 :onyx/doc "Reads messages from a Kafka topic"}

Lifecycle entry:

{:lifecycle/task :read-messages
 :lifecycle/calls :onyx.plugin.kafka/read-messages-calls}
Attributes
key type default description
:kafka/topic string The topic name to connect to
:kafka/partition string Optional: partition to read from if auto-assignment is not used
:kafka/group-id string The consumer identity to store in ZooKeeper
:kafka/zookeeper string The ZooKeeper connection string
:kafka/offset-reset keyword Offset bound to seek to when not found - :earliest or :latest
:kafka/force-reset? boolean Force to read from the beginning or end of the log, as specified by :kafka/offset-reset. If false, reads from the last acknowledged messsage if it exists
:kafka/receive-buffer-bytes integer 65536 The size in the receive buffer in the Kafka consumer.
:kafka/commit-interval integer 2000 The interval in milliseconds to commit the latest acknowledged offset to ZooKeeper
:kafka/deserializer-fn keyword A keyword that represents a fully qualified namespaced function to deserialize a message. Takes one argument - a byte array
:kafka/wrap-with-metadata? boolean false Wraps message into map with keys :offset, :partitions, :topic and :message itself
:kafka/start-offsets map Allows a task to be supplied with the starting offsets for all partitions. Maps partition to offset, e.g. {0 50, 1, 90} will start at offset 50 for partition 0, and offset 90 for partition 1
:kafka/consumer-opts map A map of arbitrary configuration to merge into the underlying Kafka consumer base configuration. Map should contain keywords as keys, and the valid values described in the Kafka Docs. Please note that key values such as fetch.min.bytes must be in keyword form, i.e. :fetch.min.bytes.
write-messages

Writes segments to a Kafka topic using the Kafka "new" producer.

Catalog entry:

{:onyx/name :write-messages
 :onyx/plugin :onyx.plugin.kafka/write-messages
 :onyx/type :output
 :onyx/medium :kafka
 :kafka/topic "topic"
 :kafka/zookeeper "127.0.0.1:2181"
 :kafka/serializer-fn :my.ns/serializer-fn
 :kafka/request-size 307200
 :onyx/batch-size batch-size
 :onyx/doc "Writes messages to a Kafka topic"}

Lifecycle entry:

{:lifecycle/task :write-messages
 :lifecycle/calls :onyx.plugin.kafka/write-messages-calls}

Segments supplied to a :onyx.plugin.kafka/write-messages task should be in in the following form: {:message message-body} with optional partition, topic and key values.

{:message message-body
 :key optional-key
 :partition optional-partition
 :topic optional-topic}
Attributes
key type default description
:kafka/topic string The topic name to connect to
:kafka/zookeeper string The ZooKeeper connection string
:kafka/serializer-fn keyword A keyword that represents a fully qualified namespaced function to serialize a message. Takes one argument - the segment
:kafka/request-size number The maximum size of request messages. Maps to the max.request.size value of the internal kafka producer.
:kafka/no-seal? boolean false Do not write :done to the topic when task receives the sentinel signal (end of batch job)
:kafka/producer-opts map A map of arbitrary configuration to merge into the underlying Kafka producer base configuration. Map should contain keywords as keys, and the valid values described in the Kafka Docs. Please note that key values such as buffer.memory must be in keyword form, i.e. :buffer.memory.

Test Utilities

A take-segments utility function is provided for use when testing the results of jobs with kafka output tasks. take-segments reads from a topic until a :done is reached, and then returns the results. Note, if a :done is never written to a topic, this will hang forever as there is no timeout.

(ns your-ns.a-test
  (:require [onyx.kafka.utils :as kpu]))

;; insert code to run a job here

;; retrieve the segments on the topic
(def results
  (kpu/take-segments (:zookeeper/addr peer-config) "yourtopic" your-decompress-fn))

(last results)
; :done

Embedded Kafka Server

An embedded Kafka server is included for use in test cases where jobs output to kafka output tasks. Note, stopping the server will not perform a graceful shutdown - please do not use this embedded server for anything other than tests.

This can be used like so:

(ns your-ns.a-test
  (:require [onyx.kafka.embedded-server :as ke]
            [com.stuartsierra.component :as component]))

(def kafka-server
  (component/start
    (ke/map->EmbeddedKafka {:hostname "127.0.0.1"
                            :port 9092
                            :broker-id 0
			    :num-partitions 1
			    ; optional log dir name - randomized dir will be created if none is supplied
			    ; :log-dir "/tmp/embedded-kafka"
			    :zookeeper-addr "127.0.0.1:2188"})))

;; insert code to run a test here

;; stop the embedded server
(component/stop kafka-server)

Development

To benchmark, start a real ZooKeeper instance (at 127.0.0.1:2181) and Kafka instance, and run the following benchmarks.

Write perf, single peer writer:

TIMBRE_LOG_LEVEL=info lein test onyx.plugin.output-bench-test :benchmark

Read perf, single peer reader:

TIMBRE_LOG_LEVEL=info lein test onyx.plugin.input-benchmark-test :benchmark

Past results are maintained in dev-resources/benchmarking/results.txt.

Contributing

Pull requests into the master branch are welcomed.

License

Copyright © 2015 Michael Drogalis

Distributed under the Eclipse Public License, the same as Clojure.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.