Root Cause Analysis
The read tcp :49560->:9092: i/o timeout error frequently appears in Sarama-based Kafka consumers handling file scanning workloads. Initial investigations often lead developers to suspect network infrastructure or disk I/O problems, but the actual cause lies in Go's context deadline mechanism combined with Sarama's consumer group rebalancing behavior.
When using consumer groups with long-running message processing, the Kafka broker's rebalance protocol interacts with the underlying net.Conn.SetDeadline implementation. When a context expires or a rebalance is triggered, the deadline set on the network connection causes operations to fail with the underlying net error rather than a context cancellation error. This produces the misleading "i/o timeout" message that obscures the real issue.
Reproduction Scenario
The error becomes reproducible under specific timeout configurations:
config.Consumer.Group.Session.Timeout = time.Second * 30
config.Consumer.Group.Heartbeat.Interval = time.Second * 5
With these setings, processing timeouts trigger rebalance cycles, exposing the underlying context deadline issue. The consumer receives:
Error from consumer: read tcp :33178->:9093: i/o timeout
Solution: Monitor Session Context
The fix requires monitoring ConsumerGroupSession.Context().Done() within the ConsumeClaim loop. This ensures the handler responds to rebalance signals rather than continuing to block on message processing.
func (h FileScanHandler) ConsumeClaim(
sess sarama.ConsumerGroupSession,
claim sarama.ConsumerGroupClaim,
) error {
for {
select {
case msg := <-claim.Messages():
if err := h.processFile(msg); err != nil {
return err
}
sess.MarkMessage(msg, "")
case <-sess.Context().Done():
return nil
}
}
}
The session context completes when either the parent context is cancelled or the broker initiates a rebalance. By listening for this signal, the consumer exits cleanly and allows the rebalance cycle to proceed within the configured timeout.
Configuration Recommendations
For workloads involving long processing times, adjust both the timeout parameters and network read timeout:
config.Consumer.Group.Session.Timeout = time.Second * 120
config.Consumer.Group.Heartbeat.Interval = time.Second * 20
config.Consumer.MaxProcessingTime = time.Minute * 10
// Net.ReadTimeout must exceed Session.Timeout to prevent premature connection failures
// See: https://github.com/Shopify/sarama/issues/1422
config.Net.ReadTimeout = config.Consumer.Group.Session.Timeout + 30*time.Second
The Net.ReadTimeout parameter defaults to 30 seconds, which conflicts with extended session timeouts. The network read timeout must be set higher than the session timeout to allow heartbeat responses and coordinator communication during long processing operations.
Consumer Group Session Lifecycle
Understanding the session lifecycle clarifies why context monitoring is essential:
- Consumer joins the group and receives partition claims
Setup()hook executes before processing beginsConsumeClaim()runs in separate goroutines for each claimed partition- Session persists until
ConsumeClaim()exits—either from context cancellation or rebalance initiation Cleanup()hook runs after all consume loops terminate- Final offset commmit occurs before releasing claims
Rebalance timeouts (Config.Consumer.Group.Rebalance.Timeout) impose strict deadlines on steps 4-6. If ConsumeClaim() blocks indefinitely without checking the session context, the consumer exceeds the rebalance timeout and gets removed from the group by Kafka, resulting in offset commit failures.
Key Takeaways
The "i/o timeout" error in Sarama consumers is frequently a context deadline error in disguise, not a genuine network infrastructure problem. Proper handler implementation requires monitoring both message channels and session context to support graceful rebalancing during long-running operations.