Installation
Download and extract Kafka:
wget --no-check-certificate https://dlcdn.apache.org/kafka/3.0.0/kafka_2.13-3.0.0.tgz
tar -xzf kafka_2.13-3.0.0.tgz
cd kafka_2.13-3.0.0
Start Zookeeper and Kafka server:
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
By default, Zookeeper listens on port 2181. For remote access, modify server.properties:
listeners=PLAINTEXT://your.machine.ip:9092
Consumer Group Rebalancing
Rebalancing is a protocol that distributes partitions among consumers within a group. When a group has 20 consumers and a topic with 100 partitions, each consumer gets approximately 5 partitions.
Trigger Conditions
- Member count changes (new or removed consumers)
- Topic subscription count changes
- Partition count of subscribed topics changes
During rebalancing, all consumers pause until the process completes.
Rebalancing Process
The process consists of two phases:
- Join: All members send JoinGroup requests to the coordinator. One is elected as leader.
- Sync: The leader assigns partitions and sends the plan back via SyncGroup requests.
Scenarios
- New member joins: Triggers rebalance
- Member crashes: Coordinator detects after session timeout
- Graceful leave: Initiates rebalance
- Offset commits: May trigger rebalance
Avoiding Unnecessary Rebalances
To prevent unwanted rebalances, tune these parameters:
session.timeout.ms: Default 10 seconds. Set to 6 seconds for faster detectionheartbeat.interval.ms: Default 3 seconds. Set to 2 secondsmax.poll.interval.ms: Default 5 minutes. Increase if processing takes longer
Ensure session.timeout.ms >= 3 * heartbeat.interval.ms for reliable heartbeats.
Key Concepts
Group Coordinator
Each broker runs a Group Coordinator service that manages group metadata and stores offsets in the internal __consumer_offsets topic. The coordinator is determined by:
partition-id = Math.abs(groupId.hashCode() % groupMetadataTopicPartitionCount)
Where groupMetadataTopicPartitionCount defaults to 50.
Performance Metrics
In a test with 10 partitions and 3 consumers, average rebalancing time was ~87ms.
Parameter Settings
Important configurations:
group.max.session.timeout.ms>session.timeout.msgroup.min.session.timeout.ms<session.timeout.msrequest.timeout.ms>session.timeout.ms+fetch.wait.max.mssession.timeout.ms/ 3 >heartbeat.interval.mssession.timeout.ms> worst-case processing time
Message Handling
Kafka uses partitioning for throughput scaling. Each topic can have multiple partitions stored as separate directories. Producers route messages based on keys using partitioning strategies.
Consumers read from partitions within their group. Only one consumer in a group reads from a specific partition. Multiple groups can consume the same topic independently.
Data Retention
Two policies exist:
- Time-based retention
- Size-based retention
Example configuration:
log.retention.hours=168
log.segment.bytes=1073741824
Practical Application
For multi-engine systems where different engines support varying license counts:
- Engine count = consumer group count
- License count = consumer count per group
- Partitions = LCM of engine license capacities
Troubleshooting
Large message errors occur when message.max.bytes is exceeded. Adjust these settings:
message.max.bytes=10485760
replica.fetch.max.bytes=10485760
fetch.message.max.bytes=10485760
socket.request.max.bytes=104857600
These limits prevent memory allocation issues when handling large payloads.