GithubHelp home page GithubHelp logo

teslamotors / kafka-helmsman Goto Github PK

View Code? Open in Web Editor NEW
95.0 95.0 31.0 277 KB

kafka-helmsman is a repository of tools that focus on automating a Kafka deployment

License: MIT License

Python 7.87% Shell 0.67% Java 85.10% Starlark 6.37%

kafka-helmsman's People

Contributors

danielfritsch avatar dependabot[bot] avatar jyates avatar mattbonnell avatar matteobaccan avatar rohagarwal avatar rohanag12 avatar scottcrossen avatar shk3 avatar shrijeet-tesla avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kafka-helmsman's Issues

Freshness tracker should fail a cluster iteration if all partitions for all consumers fails

Currently, we are very generous with the failure constraints for a cluster, from ConsumerFreshness (ln 281-293):

    // if all the consumer measurements succeed, then we return the cluster name
    // otherwise, Future.get will throw an exception representing the failure to measure a consumer (and thus the
    // failure to successfully monitor the cluster).
    return Futures.whenAllSucceed(completedConsumers).call(client::getCluster, this.executor);
  }

  /**
   * Measure the freshness for all the topic/partitions currently consumed by the given consumer group. To maintain
   * the existing contract, a consumer measurement fails ({@link Future#get()} throws an exception) only if:
   *  - burrow group status lookup fails
   *  - execution is interrupted
   * Failure to actually measure the consumer is swallowed into a log message & metric update; obviously, this is less
   * than ideal for many cases, but it will be addressed later.

However, SSL connection issues (i.e. a misconfiguration) only show up when querying the consumers. So you can have a valid burrow lookup for the cluster (b/c burrow is configured correctly) but freshness fails for each consumer because the tracker misconfigured. You would never know though (from the kafka_consumer_freshness_last_success_run_timestamp metric) since that will not get incremented for the failures.

Topic enforcer should indicate replication factor config drift

Topic enforcer can not alter replication factor once a topic has been created (Kafka doesn't allow it). Currently, enforcer run finishes silently with no indication of replication factor drift even if it detects one. A better ux would be to log the drift and inform the user that its non enforceable.

build do not produce anything

Hi, wi a clone of the repo on master branch

XXXXX@YYYY:~/REPO/test/kafka-helmsman$ bazel build //...:all
INFO: Analyzed 79 targets (0 packages loaded, 0 targets configured).
INFO: Found 79 targets...
INFO: Elapsed time: 0.070s, Critical Path: 0.00s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
XXXXX@YYYY:~/REPO/test/kafka-helmsman$ bazel version
Bazelisk version: v1.7.5
Build label: 3.4.1
Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Tue Jul 14 06:27:53 2020 (1594708073)
Build timestamp: 1594708073
Build timestamp as int: 1594708073
XXXXX@YYYY:~/REPO/test/kafka-helmsman$ javac -version
javac 1.8.0_282

build command do not produce anything , no jar

Thank you.

Freshness tracker verbosity could be improved for errors and debugging

For example, when a consumer is not available from burrow we dump a large error message in the logs

2022-06-29 09:03:52 ERROR [main] c.t.d.c.f.ConsumerFreshness:312 - Failed to read Burrow status for consumer example.missing.consumer. Skipping
java.io.IOException: Response was not successful: Response{protocol=http/1.1, code=404, message=Not Found, url=http://my.burrow/v3/kafka/my-cluster/consumer/example.missing.consumer/lag}
        at com.tesla.data.consumer.freshness.Burrow.request(Burrow.java:95)
        at com.tesla.data.consumer.freshness.Burrow.getConsumerGroupStatus(Burrow.java:111)
        at com.tesla.data.consumer.freshness.Burrow$ClusterClient.getConsumerGroupStatus(Burrow.java:144)
        at com.tesla.data.consumer.freshness.ConsumerFreshness.measureConsumer(ConsumerFreshness.java:307)
        at com.tesla.data.consumer.freshness.ConsumerFreshness.measureCluster(ConsumerFreshness.java:271)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.stream.ReferencePipeline$11$1.accept(ReferencePipeline.java:440)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)
        at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)

But these consumers can be missing lag information if burrow has include/exclusions, making these error messages just clog the logs.

Conversely, its hard to diagnose a bug for a consumer if you don't know what freshness tracker is seeing. For example, a consumer is showing as having increasing lag but burrow & kafka both say that it is up-to-date on the latest commit (this occurred recently). If this persists past a freshness-tracker restart, something is wonky in the tracker and you would want to turn on some debug logging (even if it is verbose) to see what is going on.

Use KafkaAdminClient for quota enforcement + upgrade Kafka libs to 2.6

With the current Kafka version (2.4.1), quota enforcement was implemented through the use of a Zookeeper admin client, as using the KafkaAdminClient only support quota configuration with Kafka >= 2.6, client and server-side.

With the introduction of quota enforcement functionality in this project, we had to add in the Kafka server library (which contains the ZK admin client code), which in turn required complicating the dependency environment with various Scala libraries, and scala bazel_rules.

When we are ready to upgrade Kafka to >= 2.6, it would make sense to bring remove these Scala dependencies and go back to a light-weight dependencies.yaml with just Java libraries. This entails

  • removing Kafka server library
  • upgrading kafka-client library
  • remove unneeded dependencies that were added in the below PRs
  • use KafkaAdminClient to for quota configuration

#54
#53

Docker example of ConsumerFreshness_deploy

There go a very lazy docker example of ConsumerFreshness_deploy

FROM openjdk:11-jre-slim
ADD ConsumerFreshness_deploy.jar ConsumerFreshness_deploy.jar
ADD conf.yaml conf.yaml
CMD java -jar ConsumerFreshness_deploy.jar --conf conf.yaml
version: "3"
services:
  burrow:
    build:
      context: ./burrow/
      dockerfile: Dockerfile
    volumes:
      - ./burrow/burrow.toml:/etc/burrow/burrow.toml
    ports:
      - 8000:8000
    depends_on:
      - zookeeper
      - kafka

  time_lag:
    build:
      context: ./tesla/
      dockerfile: Dockerfile
    volumes:
      - ./tesla/conf.yaml:/conf.yaml
      - ./tesla/ConsumerFreshness_deploy.jar:/ConsumerFreshness_deploy.jar
    ports:
      - 8099:8081
    depends_on:
      - burrow
      - kafka
...      

any suggestion is super welcome ๐Ÿ‘

Strange freshness calculation for some topic partitions

I've started relying on the freshness tracker for kafka consumer health alerting. Recently some of the freshness tracker metrics seem to be unreliable. I have a topic with 900 partitions. Checking offset lag via the kafka API, I see per partition offset lags oscillating between 0 and 1k. In the attached graphs, I'm singling out a single partition. The first graph shows the freshness-derived lag, the second shows burrow's reported offset lag. I can't figure out why the freshness lag is so far off and oscillates between zero and many hours.

Have you encountered something like this before?

freshness lag graph

offset lag graph

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.