Oracle RAC Cluster Heartbeat Mechanisms and Best Practices

Physical Network Configuration Guidelines

  • Avoid direct crossover cabling between nodes; use dedicated switches instead.
  • Isolate the private interconnect (heartbeat network) from the public application network.
    • If sharing a switch is unavoidable, enforce VLAN segmentation.
  • Oracle explicitly prohibits crossover cables for Clusterware internal communication due to instability and scalability limitations in two-node RAC deployments.
  • For redundancy, deploy dual private switches in an active-standby configuration.

Role of the Private Interconnect in RAC

The private network serves multiple critical functions:

  • Node health monitoring via network heartbeats.
  • Cache coherency synchronization across instances.
  • Global resource coordination.
  • Data block transfers during global cache (gc) operations—high-volume traffic necessitates Gigabit or 10-Gigabit Ethernet.

High Availability for Heartbeat Networks

  • At the OS level, implement NIC bonding (e.g., active-backup or load-balancing modes).
  • Starting with Oracle 11.2.0.2, HAIP (Highly Available IP) provides built-in redundancy for cluster interconnects without requiring OS-level bonding.

Jumbo Frames Optimization

Enabling jumbo frames (MTU ≥ 9000) can significantly enhance performance under high-throughput, CPU-bound conditions by:

  • Reducing per-packet overhead (TCP/UDP/IP headers).
  • Minimizing packet fragmentation and reassembly in the IP stack.
  • Lowering CPU utilization and inter-instance block transfer latency.

However, misconfiguration may prevent instance startup or degrade performance.

Linux Configuration and Validation

Enable jumbo frames:

ip link set eth0 mtu 9000
ip addr show eth0

Test connectivity:

# Test maximum unfragmented payload (8972 = 9000 - 28 bytes IP+ICMP header)
ping -c 2 -M do -s 8972 node2-priv
ping -c 2 -M do -s 8973 node2-priv  # Should fail if MTU=9000

# Alternative path validation
traceroute -F node2-priv 9000   # Should succeed
traceroute -F node2-priv 9001   # Should fail

If tests fail, reduce payload size incrementally to find the stable MTU.

Oracle Cluster Consistency Mechanisms

Oracle Clusterware ensures cluster-wide state consistency through three heartbeat types, all managed by the ocssd.bin process—the core of CSS (Cluster Synchronization Services). Failure of this process triggers node reboot.

Network Heartbeat (NHB)

  • Sent every seecond over the private network.
  • Verifies inter-node connectivity.

Disk Heartbeat (DHB)

  • Each node writes its status and perceived cluster membership to all Voting Files (VF) every second.
  • During split-brain scenarios, CSS uses VF data to determine valid subclusters.
  • Implemented by ocssd.bin.

Local Heartbeat (LHB)

Monitors local node and ocssd.bin health:

  • Pre-11.2: Handled by oclsomon.bin (monitors ocssd.bin) and oprocd.bin (detects node hangs).
  • 11.2.0.1+: Replaced by cssdagent.bin and cssdmonitor.bin, which receive local heartbeat signals each second.
    • If ocssd.bin misses heartbeats beyond the misscount threshold, the node is evicted.
  • 11.2.0.2+: Introduces rebootless restart—GI stack is restarted first; only if graceful shutdown fails within the I/O timeout is the node rebooted.

Heartbeat Timeouts

  • Network Heartbeat Timeout: Controlled by misscount.
    • Default: 60s (10.2.0.4–11.1), 30s (11.2+).
  • Disk Heartbeat Timeout: Controlled by disktimeout.
    • Default: 200s (10.2.0.4+).

View and modify values:

# Check current settings
crsctl get css misscount
crsctl get css disktimeout

# Update settings (example)
crsctl set css misscount 30
crsctl set css disktimeout 300

Common Failure Scenarios

Split-Brain

When private network failures cause NHB loss (but DHB remains intact), nodes independently declare others dead. Resolution relies on Voting Files:

  • Subcluster with the majority of nodes survives.
  • On tie, the subcluster containing the lowest node number wins.
  • Evicted nodes are forcibly rebooted by CRS.

Amnesia (Configuration Desynchronization)

Occurs when node-specific configuration copies diverge. Oracle prevents this by storing cluster configuration in the shared OCR (Oracle Cluster Registry). All nodes read/write the same OCR file, ensuring atomic updates.

Backup locations can be inspected via:

ocrconfig -showbackup

Tags: Oracle RAC Clusterware Heartbeat High Availability Jumbo Frames

Posted on Sat, 09 May 2026 14:38:34 +0000 by kinaski