After we upgraded our kafka brokers from 1.0.1 to 2.0.3 the Kaffe consumer started to misbehave. It is trapped in a rebalance cycle for hours while no message is consumed. Then it recovers and process all messages then after a random interval it starts the rebalance cycle again for hours again.
This happens only if the consumer consume from multiple topic.
Our Kaffe version is {:kaffe, "~> 1.14.1"},
The config:
config :kaffe,
consumer: [
endpoints: [
{"localhost", 9092}
],
worker_allocation_strategy: :worker_per_topic_partition,
offset_reset_policy: :reset_to_earliest,
topics: [
"topic-1",
"topic-2",
"topic-3"
],
consumer_group: "test-consumer",
message_handler: MyKafkaConsumer,
async_message_ack: false,
start_with_earliest_message: true
]
To reproduce start a kafka broker with version 2.3.0 and restart the application. It begins the rebalance cycle with ~60% chance. If it doesn't then restart it again until it the issue occurs.
11:12:56.271 [info] event#startup=Elixir.Kaffe.WorkerSupervisor subscriber_name=test_consumer
11:12:56.279 [info] event#starting=Elixir.Kaffe.WorkerManager subscriber_name=test_consumer supervisor=#PID<0.1061.0>
11:12:56.279 [debug] Starting group member for topic: topic-1
Interactive Elixir (1.6.4) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> 11:12:56.368 [info] event#init=Elixir.Kaffe.GroupMember
group_coordinator=#PID<0.1071.0>
subscriber_name=test_consumer
consumer_group=test_consumer
11:12:56.368 [info] event#assignments_revoked=Elixir.Kaffe.GroupMember.test_consumer.topic-1
11:12:56.368 [debug] Starting group member for topic: topic-2
11:12:56.370 [info] event#init=Elixir.Kaffe.GroupMember
group_coordinator=#PID<0.1077.0>
subscriber_name=test_consumer
consumer_group=test_consumer
11:12:56.370 [info] event#assignments_revoked=Elixir.Kaffe.GroupMember.test_consumer.topic-2
11:12:56.370 [debug] Starting group member for topic: topic-3
11:12:56.372 [info] event#init=Elixir.Kaffe.GroupMember
group_coordinator=#PID<0.1083.0>
subscriber_name=test_consumer
consumer_group=test_consumer
11:12:56.372 [info] event#assignments_revoked=Elixir.Kaffe.GroupMember.test_consumer.topic-3
11:12:56.384 [info] Group member (test_consumer,coor=#PID<0.1071.0>,cb=#PID<0.1065.0>,generation=14):
elected=false
11:12:56.385 [info] Group member (test_consumer,coor=#PID<0.1077.0>,cb=#PID<0.1073.0>,generation=14):
elected=true
11:12:56.385 [info] Group member (test_consumer,coor=#PID<0.1083.0>,cb=#PID<0.1079.0>,generation=15):
elected=true
11:12:56.387 [info] event#assignments_revoked=Elixir.Kaffe.GroupMember.test_consumer.topic-2
11:12:56.387 [info] event#assignments_revoked=Elixir.Kaffe.GroupMember.test_consumer.topic-1
11:12:56.387 [info] Group member (test_consumer,coor=#PID<0.1077.0>,cb=#PID<0.1073.0>,generation=14):
failed to join group
reason: :unknown_member_id
11:12:56.387 [info] Group member (test_consumer,coor=#PID<0.1077.0>,cb=#PID<0.1073.0>,generation=14):
re-joining group, reason::unknown_member_id
11:12:56.387 [info] Group member (test_consumer,coor=#PID<0.1071.0>,cb=#PID<0.1065.0>,generation=14):
failed to join group
reason: :unknown_member_id
11:12:56.387 [info] Group member (test_consumer,coor=#PID<0.1071.0>,cb=#PID<0.1065.0>,generation=14):
re-joining group, reason::unknown_member_id
11:12:56.390 [info] event#assignments_received=Elixir.Kaffe.GroupMember.test_consumer.topic-3 generation_id=15
11:12:56.390 [info] Group member (test_consumer,coor=#PID<0.1083.0>,cb=#PID<0.1079.0>,generation=15):
assignments received:
topic-3:
partition=0 begin_offset=undefined
11:12:56.392 [info] Group member (test_consumer,coor=#PID<0.1077.0>,cb=#PID<0.1073.0>,generation=16):
elected=true
11:12:56.395 [info] Group member (test_consumer,coor=#PID<0.1071.0>,cb=#PID<0.1065.0>,generation=17):
elected=true
11:12:56.396 [info] event#assignments_received=Elixir.Kaffe.GroupMember.test_consumer.topic-2 generation_id=16
11:12:56.396 [info] Group member (test_consumer,coor=#PID<0.1077.0>,cb=#PID<0.1073.0>,generation=16):
assignments received:
topic-2:
partition=0 begin_offset=undefined
11:12:56.398 [info] event#assignments_received=Elixir.Kaffe.GroupMember.test_consumer.topic-1 generation_id=17
11:12:56.399 [info] Group member (test_consumer,coor=#PID<0.1071.0>,cb=#PID<0.1065.0>,generation=17):
assignments received:
topic-1:
partition=0 begin_offset=undefined
11:13:01.374 [info] event#assignments_revoked=Elixir.Kaffe.GroupMember.test_consumer.topic-2
11:13:01.374 [info] Group member (test_consumer,coor=#PID<0.1077.0>,cb=#PID<0.1073.0>,generation=16):
re-joining group, reason::unknown_member_id
11:13:01.375 [info] event#assignments_revoked=Elixir.Kaffe.GroupMember.test_consumer.topic-3
11:13:01.375 [info] Group member (test_consumer,coor=#PID<0.1083.0>,cb=#PID<0.1079.0>,generation=15):
re-joining group, reason::unknown_member_id
11:13:01.379 [info] Group member (test_consumer,coor=#PID<0.1077.0>,cb=#PID<0.1073.0>,generation=18):
elected=true
11:13:01.382 [info] event#assignments_revoked=Elixir.Kaffe.GroupMember.test_consumer.topic-2
11:13:01.382 [info] Group member (test_consumer,coor=#PID<0.1077.0>,cb=#PID<0.1073.0>,generation=18):
failed to join group
reason: :rebalance_in_progress
11:13:01.382 [info] Group member (test_consumer,coor=#PID<0.1083.0>,cb=#PID<0.1079.0>,generation=19):
elected=true
11:13:01.382 [info] Group member (test_consumer,coor=#PID<0.1077.0>,cb=#PID<0.1073.0>,generation=18):
re-joining group, reason::rebalance_in_progress
11:13:01.386 [info] Group member (test_consumer,coor=#PID<0.1077.0>,cb=#PID<0.1073.0>,generation=18):
failed to join group
reason: :unknown_member_id
11:13:01.387 [info] event#assignments_received=Elixir.Kaffe.GroupMember.test_consumer.topic-3 generation_id=19
11:13:01.387 [info] Group member (test_consumer,coor=#PID<0.1083.0>,cb=#PID<0.1079.0>,generation=19):
assignments received:
topic-3:
partition=0 begin_offset=undefined
11:13:02.387 [info] event#assignments_revoked=Elixir.Kaffe.GroupMember.test_consumer.topic-2
11:13:02.387 [info] Group member (test_consumer,coor=#PID<0.1077.0>,cb=#PID<0.1073.0>,generation=18):
re-joining group, reason::unknown_member_id
11:13:02.396 [info] Group member (test_consumer,coor=#PID<0.1077.0>,cb=#PID<0.1073.0>,generation=20):
elected=true
11:13:02.400 [info] event#assignments_received=Elixir.Kaffe.GroupMember.test_consumer.topic-2 generation_id=20
11:13:02.400 [info] Group member (test_consumer,coor=#PID<0.1077.0>,cb=#PID<0.1073.0>,generation=20):
assignments received:
topic-2:
partition=0 begin_offset=undefined
11:13:06.372 [info] event#assignments_revoked=Elixir.Kaffe.GroupMember.test_consumer.topic-1
11:13:06.372 [info] Group member (test_consumer,coor=#PID<0.1071.0>,cb=#PID<0.1065.0>,generation=17):
re-joining group, reason::unknown_member_id
11:13:06.377 [info] event#assignments_revoked=Elixir.Kaffe.GroupMember.test_consumer.topic-3
11:13:06.377 [info] Group member (test_consumer,coor=#PID<0.1083.0>,cb=#PID<0.1079.0>,generation=19):
re-joining group, reason::unknown_member_id
11:13:06.381 [info] Group member (test_consumer,coor=#PID<0.1071.0>,cb=#PID<0.1065.0>,generation=21):
elected=true
11:13:06.383 [info] Group member (test_consumer,coor=#PID<0.1083.0>,cb=#PID<0.1079.0>,generation=22):
elected=true
11:13:06.384 [info] Group member (test_consumer,coor=#PID<0.1071.0>,cb=#PID<0.1065.0>,generation=21):
failed to join group
reason: :unknown_member_id
11:13:06.384 [info] event#assignments_revoked=Elixir.Kaffe.GroupMember.test_consumer.topic-1
11:13:06.384 [info] Group member (test_consumer,coor=#PID<0.1071.0>,cb=#PID<0.1065.0>,generation=21):
re-joining group, reason::unknown_member_id
11:13:06.387 [info] event#assignments_received=Elixir.Kaffe.GroupMember.test_consumer.topic-3 generation_id=22
11:13:06.387 [info] Group member (test_consumer,coor=#PID<0.1083.0>,cb=#PID<0.1079.0>,generation=22):
assignments received:
topic-3:
partition=0 begin_offset=undefined
11:13:06.389 [info] Group member (test_consumer,coor=#PID<0.1071.0>,cb=#PID<0.1065.0>,generation=23):
elected=true
11:13:06.390 [debug] Discarding old generation 15 for current generation: 22
11:13:06.393 [info] event#assignments_received=Elixir.Kaffe.GroupMember.test_consumer.topic-1 generation_id=23
11:13:06.393 [info] Group member (test_consumer,coor=#PID<0.1071.0>,cb=#PID<0.1065.0>,generation=23):
assignments received:
topic-1:
partition=0 begin_offset=undefined
11:13:06.396 [debug] Discarding old generation 16 for current generation: 20
11:13:06.399 [debug] Discarding old generation 17 for current generation: 23
11:13:11.376 [info] event#assignments_revoked=Elixir.Kaffe.GroupMember.test_consumer.topic-2
11:13:11.376 [info] Group member (test_consumer,coor=#PID<0.1077.0>,cb=#PID<0.1073.0>,generation=20):
re-joining group, reason::unknown_member_id
11:13:11.377 [info] Group member (test_consumer,coor=#PID<0.1083.0>,cb=#PID<0.1079.0>,generation=22):
re-joining group, reason::unknown_member_id
11:13:11.377 [info] event#assignments_revoked=Elixir.Kaffe.GroupMember.test_consumer.topic-3
11:13:11.384 [info] Group member (test_consumer,coor=#PID<0.1077.0>,cb=#PID<0.1073.0>,generation=24):
elected=true
11:13:11.384 [info] Group member (test_consumer,coor=#PID<0.1083.0>,cb=#PID<0.1079.0>,generation=24):
elected=false
11:13:11.388 [info] event#assignments_received=Elixir.Kaffe.GroupMember.test_consumer.topic-2 generation_id=24
11:13:11.388 [debug] Discarding old generation 19 for current generation: 22
11:13:11.388 [info] Group member (test_consumer,coor=#PID<0.1077.0>,cb=#PID<0.1073.0>,generation=24):
assignments received:
topic-2:
partition=0 begin_offset=undefined
11:13:11.389 [info] event#assignments_received=Elixir.Kaffe.GroupMember.test_consumer.topic-3 generation_id=24
11:13:11.389 [info] Group member (test_consumer,coor=#PID<0.1083.0>,cb=#PID<0.1079.0>,generation=24):
assignments received:
topic-3:
partition=0 begin_offset=undefined
11:13:12.401 [debug] Discarding old generation 20 for current generation: 24
11:13:16.374 [info] Group member (test_consumer,coor=#PID<0.1071.0>,cb=#PID<0.1065.0>,generation=23):
re-joining group, reason::unknown_member_id
11:13:16.374 [info] event#assignments_revoked=Elixir.Kaffe.GroupMember.test_consumer.topic-1
11:13:16.380 [info] Group member (test_consumer,coor=#PID<0.1071.0>,cb=#PID<0.1065.0>,generation=25):
elected=true
11:13:16.385 [info] event#assignments_received=Elixir.Kaffe.GroupMember.test_consumer.topic-1 generation_id=25
11:13:16.385 [info] Group member (test_consumer,coor=#PID<0.1071.0>,cb=#PID<0.1065.0>,generation=25):
assignments received:
topic-1:
partition=0 begin_offset=undefined
11:13:16.388 [debug] Discarding old generation 22 for current generation: 24
11:13:16.395 [debug] Discarding old generation 23 for current generation: 25
...and it continues for hours until somehow all group member receives the assignments for the same generation.
kafka_1 | [2019-07-23 09:12:56,386] INFO [GroupCoordinator 0]: Preparing to rebalance group test_consumer in state PreparingRebalance with old generation 13 (__consumer_offsets-7) (reason: Adding new member nonode@nohost/<0.1077.0>-625143d0-22a8-4a51-9d40-4eecbc66a1bb with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:12:56,387] INFO [GroupCoordinator 0]: Stabilized group test_consumer generation 14 (__consumer_offsets-7) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:12:56,388] INFO [GroupCoordinator 0]: Preparing to rebalance group test_consumer in state PreparingRebalance with old generation 14 (__consumer_offsets-7) (reason: Adding new member nonode@nohost/<0.1083.0>-9db8b1f9-17b1-43d3-ab18-0f8daecff052 with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:12:56,388] INFO [GroupCoordinator 0]: Stabilized group test_consumer generation 15 (__consumer_offsets-7) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:12:56,391] INFO [GroupCoordinator 0]: Assignment received from leader for group test_consumer for generation 15 (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:12:56,394] INFO [GroupCoordinator 0]: Preparing to rebalance group test_consumer in state PreparingRebalance with old generation 15 (__consumer_offsets-7) (reason: Adding new member nonode@nohost/<0.1077.0>-c805a9d7-bb33-4b04-8ea3-8df99fcc49f6 with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:12:56,395] INFO [GroupCoordinator 0]: Stabilized group test_consumer generation 16 (__consumer_offsets-7) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:12:56,397] INFO [GroupCoordinator 0]: Assignment received from leader for group test_consumer for generation 16 (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:12:56,398] INFO [GroupCoordinator 0]: Preparing to rebalance group test_consumer in state PreparingRebalance with old generation 16 (__consumer_offsets-7) (reason: Adding new member nonode@nohost/<0.1071.0>-0d583d67-dcd2-4115-8d4d-8552b7c455e9 with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:12:56,398] INFO [GroupCoordinator 0]: Stabilized group test_consumer generation 17 (__consumer_offsets-7) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:12:56,400] INFO [GroupCoordinator 0]: Assignment received from leader for group test_consumer for generation 17 (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:01,387] INFO [GroupCoordinator 0]: Preparing to rebalance group test_consumer in state PreparingRebalance with old generation 17 (__consumer_offsets-7) (reason: Adding new member nonode@nohost/<0.1077.0>-ed514f3c-3fb7-4523-bb43-0eb67b5ee5c5 with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:01,388] INFO [GroupCoordinator 0]: Stabilized group test_consumer generation 18 (__consumer_offsets-7) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:01,390] INFO [GroupCoordinator 0]: Preparing to rebalance group test_consumer in state PreparingRebalance with old generation 18 (__consumer_offsets-7) (reason: Adding new member nonode@nohost/<0.1083.0>-cbc2b2df-0ade-430e-ba12-22a3205a04eb with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:01,391] INFO [GroupCoordinator 0]: Stabilized group test_consumer generation 19 (__consumer_offsets-7) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:01,394] INFO [GroupCoordinator 0]: Assignment received from leader for group test_consumer for generation 19 (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:02,404] INFO [GroupCoordinator 0]: Preparing to rebalance group test_consumer in state PreparingRebalance with old generation 19 (__consumer_offsets-7) (reason: Adding new member nonode@nohost/<0.1077.0>-76c36c71-519f-45f8-953a-80eb352be715 with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:02,405] INFO [GroupCoordinator 0]: Stabilized group test_consumer generation 20 (__consumer_offsets-7) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:02,408] INFO [GroupCoordinator 0]: Assignment received from leader for group test_consumer for generation 20 (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:06,394] INFO [GroupCoordinator 0]: Preparing to rebalance group test_consumer in state PreparingRebalance with old generation 20 (__consumer_offsets-7) (reason: Adding new member nonode@nohost/<0.1071.0>-bf04c78a-3c5c-4259-8d37-c747e1e2b0aa with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:06,395] INFO [GroupCoordinator 0]: Stabilized group test_consumer generation 21 (__consumer_offsets-7) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:06,397] INFO [GroupCoordinator 0]: Preparing to rebalance group test_consumer in state PreparingRebalance with old generation 21 (__consumer_offsets-7) (reason: Adding new member nonode@nohost/<0.1083.0>-25b81c73-7bde-4392-bb7e-d28ea2b3e7f3 with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:06,398] INFO [GroupCoordinator 0]: Stabilized group test_consumer generation 22 (__consumer_offsets-7) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:06,400] INFO [GroupCoordinator 0]: Assignment received from leader for group test_consumer for generation 22 (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:06,403] INFO [GroupCoordinator 0]: Preparing to rebalance group test_consumer in state PreparingRebalance with old generation 22 (__consumer_offsets-7) (reason: Adding new member nonode@nohost/<0.1071.0>-bea28729-3749-4428-be2b-1f3b96494333 with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:06,404] INFO [GroupCoordinator 0]: Stabilized group test_consumer generation 23 (__consumer_offsets-7) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:06,406] INFO [GroupCoordinator 0]: Assignment received from leader for group test_consumer for generation 23 (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:11,403] INFO [GroupCoordinator 0]: Preparing to rebalance group test_consumer in state PreparingRebalance with old generation 23 (__consumer_offsets-7) (reason: Adding new member nonode@nohost/<0.1077.0>-7fdb63ba-4fba-432d-9af3-89411c63d4c9 with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:11,404] INFO [GroupCoordinator 0]: Stabilized group test_consumer generation 24 (__consumer_offsets-7) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:11,407] INFO [GroupCoordinator 0]: Assignment received from leader for group test_consumer for generation 24 (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:16,406] INFO [GroupCoordinator 0]: Preparing to rebalance group test_consumer in state PreparingRebalance with old generation 24 (__consumer_offsets-7) (reason: Adding new member nonode@nohost/<0.1071.0>-1478c6db-6b99-4f75-9f47-5cf00500d913 with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:16,407] INFO [GroupCoordinator 0]: Stabilized group test_consumer generation 25 (__consumer_offsets-7) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:16,409] INFO [GroupCoordinator 0]: Assignment received from leader for group test_consumer for generation 25 (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:21,415] INFO [GroupCoordinator 0]: Preparing to rebalance group test_consumer in state PreparingRebalance with old generation 25 (__consumer_offsets-7) (reason: Adding new member nonode@nohost/<0.1077.0>-ef0eb8e4-fabd-41bb-a55f-e1af641914ff with group instanceid None) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:21,416] INFO [GroupCoordinator 0]: Stabilized group test_consumer generation 26 (__consumer_offsets-7) (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:21,418] INFO [GroupCoordinator 0]: Assignment received from leader for group test_consumer for generation 26 (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:51,384] INFO [GroupCoordinator 0]: Member nonode@nohost/<0.1083.0>-b1810a29-4f1a-40a6-8528-3bf4db48ff0a in group test_consumer has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
kafka_1 | [2019-07-23 09:13:51,384] INFO [GroupCoordinator 0]: Preparing to rebalance group test_consumer in state PreparingRebalance with old generation 26 (__consumer_offsets-7) (reason: removing member nonode@nohost/<0.1083.0>-b1810a29-4f1a-40a6-8528-3bf4db48ff0a on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
I think somehow the rebalance for each topic affects the consumers for the other topics and if different topics get the assignments in different generation they invalidate each other.
The issue was not present with the older 1.0.1 broker before.
I hope this description is enough for reproducing the issue.