MySQL Architecture Deep Dive: Storage, Indexing, Concurrency, and Query Optimization

1.1 Disk I/O Characteristics

Hard disk performance is fundamentally bounded by mechanical movement. A single I/O operation comprises three phases:

Seek Time: Physical movement of the read/write head to the correct track (typically 3–15 ms).
Rotational Latency: Wait time for the target sector to rotate under the head. For a 7200 RPM drive, average latency is ~4.17 ms.
Transfer Time: Data movement speed, heavily dependent on interface bandwidth (e.g., SATA III). Usually negligible compared to mechanical overhead.

Random I/O performance is best measured by IOPS (Input/Output Operations Per Second), calculated as 1000 / (seek + latency). Sequential workloads prioritize Throughput, which scales with RAID configuration and channel bandwidth.

1.2 Page Layout & Row Formats

InnoDB operates on 16 KB pages as the minimum I/O unit. Each page contains a header, trailer, user records, and page directory slots for fast binary search positioning.

Row Formats:

COMPACT: Stores variable-length field lengths in reverse byte order. Handles overflow by keeping 20 bytes in the main page and offloading the remainder to external overflow pages.
DYNAMIC (Default): Entire overflow fields are moved out of the main page, reducing fragmentation and improving page density.
COMPRESSED: Applies zlib compression to dynamic row data, trading CPU cycles for storage savings.

Each record includes hidden fields: trx_id (6 bytes), roll_pointer (7 bytes), and optionally row_id (6 bytes) if no primary key is defined.

1.3 Extents, Segments & Tablespaces

Pages are grouped into 1 MB extents to optimize sequential reads. Multiple extents form segments (e.g., data segment for leaf nodes, index segment for non-leaf nodes). The highest logical structure is the tablespace, typically stored as an .ibd file per table.

Index Architecture & B+ Tree Mechanics

2.1 Clustered vs. Secondary Indexes

InnoDB stores data in B+ trees. The clustered index holds complete row data in its leaf nodes. Secondary indexes store only the indexed columns plus the primary key, requiring a "bookmark lookup" (回表) to fetch full rows.

A B+ tree with a fan-out of ~1,200 can store over 1.7 billion records at a height of 4. Since the root page usually resides in the buffer pool, querying a billion-row table typically requires only 3 disk accesses.

2.2 Covering Indexes & Index Condition Pushdown (ICP)

When a query's selected columns are fully contained within an index, a covering index eliminates bookmark lookups entirely. MySQL 5.6+ introduced Index Condition Pushdown, allownig the storage engine to filter rows using secondary index columns before fetching the full row, drastically reducing I/O.

2.3 Prefix Indexes

For long strings, indexing a prefix balances storage and selectivity:

ALTER TABLE accounts ADD INDEX idx_email_prefix (email(7));
SELECT id, username FROM accounts WHERE email = 'dev@enterprise.io';

Optimal prefix length is determined by comparing distinct value counts:

SELECT COUNT(DISTINCT LEFT(email, 4)) AS l4,
       COUNT(DISTINCT LEFT(email, 7)) AS l7
FROM accounts;

Caution: Prefix indexes disable covering queries and complicate ORDER BY or GROUP BY operations.

Query Processing & Execution Plans

3.1 Optimization Pipeline

SQL statements pass through the parser (syntax validation), optimizer (cost-based plan selection using IO, CPU, and memory metrics), and executor (engine API calls). The optimizer calculates total cost as:

Cost = IO_Cost + CPU_Cost + Memory_Cost

3.2 EXPLAIN Analysis

Key metrics in execution plans:

type: Access method hierarchy: const > eq_ref > ref > range > index > ALL
key: Actual index utilized.
rows: Estimated rows scanned.
Extra: Operational details like Using index, Using where, Using filesort, or Using temporary.
filtered: Percentage of rows passing the condition after index scan.

3.3 JOIN Execution & Sorting

MySQL uses nested-loop joins. The optimizer selects the table with the fewer matching rows as the driving table. For LEFT JOIN, the left table is fixed as the driver.

Sorting relies on sort_buffer_size. If row length exceeds max_length_for_sort_data, MySQL switches from full-row sorting to row-id sorting, reducing memory pressure but increasing random I/O during final lookups.

Pagination Optimization: Avoid deep offsets. Use keyset pagination:

SELECT id, title FROM articles WHERE created_at > :last_seen ORDER BY created_at ASC LIMIT 50;

Transaction Lifecycle & MVCC

4.1 ACID & Isolation Levels

InnoDB implements isolation through locking and Multi-Version Concurrency Control (MVCC):

READ UNCOMMITTED: Allows dirty reads.
READ COMMITTED (RC): Generates a new ReadView per statement. Solves dirty reads but allows non-repeatable reads.
REPEATABLE READ (RR - Default): Generates one ReadView per transaction. Uses MVCC for snapshot reads and next-key locks for current reads (FOR UPDATE) to prevent phantom reads.
SERIALIZABLE: Forces all reads to acquire shared locks. Severely limits concurrency.

4.2 MVCC Visibility Rules

Each row contains trx_id and roll_pointer. A ReadView captures active transaction IDs at creation time. Visibility logic:

If trx_id == creator_trx_id: Visible (own changes).
If trx_id < min_trx_id: Visible (committed before view).
If trx_id >= max_trx_id: Invisible (started after view).
If min_trx_id <= trx_id < max_trx_id: Check active list. If present, invisible; otherwise, visible.

Updates append new versions to an undo log chain, enabling non-blocking snapshot reads.

Locking Protocols & Concurrency Control

5.1 Lock Granularity & Types

Table-level locks include metadata locks (MDL) and auto-increment locks. Row-level locks operate on index records:

Record Lock: Locks specific index entries.
Gap Lock: Locks ranges between index values to prevent insertions.
Next-Key Lock: Combination of record + gap lock. Default for range scans under RR.

Locks downgrade automatically: equality searches on unique indexes reduce next-key locks to record locks. Scans past the target range reduce them to gap locks.

5.2 Deadlock Prevention

Deadlocks occur when cyclic dependencies form. Mitigation strategies:

Acquire locks in a consistent global order.
Keep transactions short; release high-contention rows last.
Use tryLock() with timeouts instead of indefinite blocking.
Enable innodb_change_buffer_max_size for write-heavy secondary index updates to defer disk I/O.

Replication, Backup & High Availability

6.1 Binlog & Recovery

The binary log records all DDL and DML events at the Server layer. Unlike redo logs (physical, cyclic), binlogs are logical, append-only, and used for point-in-time recovery:

mysqlbinlog --start-position=1024 --stop-position=4096 /var/log/mysql/binlog.000005 | mysql -u root -p

6.2 Master-Slave Architecture

Replication uses three threads: I/O thread (fetches binlog), SQL thread (applies relay log events), and binary log dump thread (on master). Parallel replication (MySQL 5.7+) distributes worker threads per schema or logical group to reduce lag.

Delay Mitigation:

Avoid massive DML operations; batch inserts into chunks of ~200.
Use semi-synchronous replication for critical consistency.
Direct post-write reads to the primary node.

Schema Design & Operational Guidelines

7.1 Data Types & Collation

Prefer fixed-width integers over strings for indexes. For dates, use INT (e.g., 20240115) to save space and speed up comparisons. Choose collations deliberately:

utf8mb4_unicode_ci: Standard, case-insensitive, accurate sorting.
utf8mb4_bin: Byte-level comparison, case-sensitive. Required for unique constraints on case-sensitive identifiers like API keys or order codes.

7.2 Primary Key Strategy

Auto-increment integers provide optimal page insertion order and minimal storage overhead. UUIDs cause random I/O and page splits. If distributed IDs are required, use snowflake algorithms or timestamp-prefixed binary UUIDs to preserve locality.

7.3 Transaction & Security Practices

Keep transaction boundaries minimal. Never mix local DML with synchronous RPC calls inside a single transaction boundary, as network timeouts can roll back successfully committed remote operations. Use idempotent retry mechanisms or outbox patterns for distributed consistency.

Never store plaintext credentials. Implement cryptographic hashing (e.g., bcrypt) or asymmetric encryption for sensitive fields:

public static String secureEncrypt(String plaintext, PublicKey pubKey) throws Exception {
    Cipher cipher = Cipher.getInstance("RSA/ECB/OAEPWithSHA-256AndMGF1Padding");
    cipher.init(Cipher.ENCRYPT_MODE, pubKey);
    byte[] result = cipher.doFinal(plaintext.getBytes(StandardCharsets.UTF_8));
    return Base64.getEncoder().encodeToString(result);
}

7.4 Index Maintenance

Rebuild tables periodically to defragment pages and update statistics:

ALTER TABLE transactions ENGINE=InnoDB;

Monitor unused or redundant indexes via the sys schema to eliminate write amplification:

SELECT * FROM sys.schema_unused_indexes WHERE table_schema = 'analytics_db';

Tags: MySQL InnoDB database-optimization MVCC indexing

Posted on Tue, 30 Jun 2026 16:25:13 +0000 by mooler

Freaks City