Kafka Consumer Rebalance Issues: 'Member Not Known' and 'I/O Timeout' Troubleshooting

Problem Description

When running Kafka consumers in production, you may encounter the following error patterns:

Client-side logs:

The provided member is not known in the current generation
i/o timeout

Server-side logs (broker):

[GroupCoordinator 0]: Sending empty assignment to member watermill-xxx of group-name for generation 14 with no errors
[GroupCoordinator 0]: Dynamic Member with unknown member id joins group avast in PreparingRebalance state. Created a new member id watermill-xxx for this member
[GroupCoordinator 0]: Stabilized group avast generation 49 (__consumer_offsets-23) with 4 members

This typically manifests as consumers experiencing frequent rebalances after running normally for days, leading to message lag.

Root Cause Analysis

Cause 1: Rebalance Timeout Exceeded

The Kafka consumer group session lifecycle follows these steps:

  1. Consumers join the group and receive partition assignments
  2. Setup() hook is called before processing begins
  3. ConsumeClaim() is invoked for each assigned partition in separate goroutines
  4. Session persists until ConsumeClaim() exits (context cancelled or rebalance initiated)
  5. Cleanup() hook runs after all ConsumeClaim() loops exit
  6. Final offset commit before releasing claims

Critical constraint: Once a rebalence is triggered, sessions must complete within Config.Consumer.Group.Rebalance.Timeout. If ConsumeClaim() functions don't exit quickly enough, the broker removes the consumer from the group, causing offset commit failures.

Cause 2: Shared Group ID Across Multiple Topics

The most common production issue: Multiple consumers using the same group.id but subscribing to different topics.

Expected behavior:

one client + one group.id + one topic = expected

Actual problematic configuration:

one client + one group.id + two topics + three partitions = problematic

When different topics share a group.id, any consumer going offline triggers rebalancing for ALL consumers in that group—even those on unrelated topics.

Server logs showing this pattern:

[GroupCoordinator 0]: Dynamic Member with unknown member id joins group avast in PreparingRebalance state.
[GroupCoordinator 0]: Preparing to rebalance group avast in state PreparingRebalance with old generation 48
[GroupCoordinator 0]: Member has left group avast through explicit LeaveGroup request
[GroupCoordinator 0]: Group avast removed dynamic members who haven't joined

This occurs because Kafka internally merges partitions across all subscribed topics. When one partition's consumer disconnects, the coordinator notifies the entire group.

Cause 3: Library-Specific Issues

Certain libraries like Watermill have known issues where network timeouts compound with rebalance behavior, creating cascading failures.

Why Standard Fixes Don't Work

Increasing timeout values doesn't help because the underlying issue is frequent rebalancing, not insufficeint timeout windows.

Offloading processing to channels doesn't help because the heartbeat mechanism runs independently. Each consumer maintains a separate goroutine sending heartbeats every 3 seconds, so slow processing only affects consumption rate, not group stability.

Solution

Use unique consumer group IDs. Avoid generic or shared group names across different applications or topics.

Good naming convention:

consumerGroupID := fmt.Sprintf("%s-%s-%s", appName, topicName, environment)
// Example: "payment-processor-orders-prod"

Verification: If you isolate a single consumer to its own group and rebalancing stops, you've confirmed this is the issue.

Complete Error Flow

When the shared group ID issue causes rebalancing:

  1. One consumer disconnects → group initiates rebalance
  2. Session cancellation occurs within Rebalance.Timeout window
  3. Old connections attempt to reconnect with stale generation numbers
  4. Broker rejects requests with "The provided member is not known in the current generation"
  5. TCP connections timeout waiting for responses → "i/o timeout"

Configuration Recommendations

config := &sarama.Config {
    Consumer: {
        Group: {
            Session: sarama.Duration {
                // Set appropriately for your network conditions
                Timeout: 30 * time.Second,
            },
            Rebalance: {
                Timeout: 60 * time.Second,
                Strategy: sarama.NewBalanceStrategyRoundRobin(),
            },
        },
    },
}

Key principle: Invest as much effort in naming consumer groups as you do in naming topics. Generic names like "consumer-group" or "processor" will inevitably conflict in any moderately complex deployment.

Tags: Kafka troubleshooting consumer-group rebalance sarama

Posted on Sat, 06 Jun 2026 16:38:56 +0000 by dfego