Deep Dive into MySQL Index Architecture and Query Optimization

Index Architecture and Storage Structure

Indexes function as sorted data structures designed to accelerate data retrieval operations such as queries, updates, and sorting. Conceptually, they operate similarly to a library catalog, allowing direct access to specific data locations without scanning the entire dataset. In the absence of an index, a database must perform a full table scan, resulting in significant I/O overhead.

The B+ Tree Structure

MySQL's InnoDB engine utilizes the B+ Tree (Balanced Tree Plus) as its default index structure. In a B+ Tree, non-leaf nodes store only key values and pointers to child nodes, while leaf nodes store the actual data or pointers to data. The leaf nodes are linked via a doubly linked list, facilitating efficient range queries and ordered traversals.

This structure is optimized for disk storage. Since data is stored on disk blocks (pages), the "short and fat" nature of the B+ Tree minimizes the number of disk I/O operations required to locate data. A three-layer B+ Tree can store millions of records, meaning most queries require only three disk I/Os to find the target leaf node.

Comparison with Other Structures

B+ Tree vs. B Tree: B Trees store data in every node. B+ Trees store data only in leaves. This allows B+ Tree non-leaf nodes to hold more keys, reducing the tree height and I/O operations. Furthermore, the linked list structure of B+ Tree leaves makes range scans significantly faster.
B+ Tree vs. Hash Index: Hash indexes provide O(1) complexity for exact matches but cannot support range queries, ordering, or fuzzy lookups. B+ Trees support all these operations effectively.
B+ Tree vs. Binary Search Tree (BST): BSTs can become unbalanced, leading to a tree height equal to the number of nodes in the worst case. A B+ Tree with a degree of roughly 1000 can hold massive amounts of data in just 3 levels, whereas a BST would require 25 levels (and thus 25 I/Os) for the same data volume.

Clustered vs. Non-Clustered Indexes

In InnoDB, indexes are categorized into Clustered Indexes and Secondary (Non-Clustered) Indexes.

Clustered Index (Primary Key): The leaf nodes of a clustered index contain the actual row data. A table can have only one clustered index. Queries using the primary key are extremely fast as they retrieve data directly without additional lookups.
Secondary Index: The leaf nodes of a secondary index contain the primary key value of the corresponding row. To retrieve full row data, the database first searches the secondary index to get the primary key and then performs a "bookmark lookup" (table return) using the clustered index. This double lookup is why secondary index queries are slower than primary key queries.

In contrast, MyISAM uses a non-clustered architecture where data and indexes are stored separately. Both primary and secondary indexes in MyISAM point directly to the data row's physical address.

Joint Indexes and the Leftmost Prefix Principle

A joint (composite) index is built on multiple columns, e.g., INDEX(idx_a, idx_b, idx_c). The index is sorted primarily by the first column, then the second, and so on.

Leftmost Prefix Matching

The index can be used effectively only if the query conditions match the order of the columns from left to right.

-- Valid: Matches the prefix
SELECT * FROM orders WHERE region = 'US' AND status = 'NEW';

-- Invalid: Skips 'region', index fails
SELECT * FROM orders WHERE status = 'NEW';

Range Query Impact

If a query includes a range condition (e.g., >, <, BETWEEN) on a column in the joint index, the index optimization stops at that column. Subsequent columns in the index definition cannot be used for searching.

-- 'region' and 'amount' use the index; 'status' does not.
SELECT * FROM sales WHERE region = 'US' AND amount > 1000 AND status = 'NEW';

Covering Indexes

A covering index occurs when a query's SELECT columns are fully contained within the index. In this scenario, the query engine retrieves the result directly from the index leaf nodes without performing a table lookup (returning to the clustered index).

-- If index exists on (region, status)
SELECT region, status FROM orders WHERE region = 'US';

Index Failure Scenarios

Several common patterns can cause an index to be ignored, leading to full table scans:

Calculations on Columns: Using functions or arithmetic on the indexed column in the WHERE clause invalidates the index.

-- Index fails
SELECT * FROM users WHERE YEAR(create_time) = 2023;
-- Optimized: Index works
SELECT * FROM users WHERE create_time >= '2023-01-01' AND create_time < '2024-01-01';

Type Conversion: Implicit type casting (e.g., comparing a string column to a numeric value) prevents index usage.
Leading Wildcard: LIKE '%abc' scans the entire table. LIKE 'abc%' can use the index.
OR Conditions: If columns connected by OR are not all indexed, the optimizer may choose a full table scan. UNION is often a better alternative.

Optimizing `ORDER BY` and `GROUP BY`

ORDER BY can utilize an index if the sorting order matches the index order. If not, the database performs a filesort operation in memory or on disk.

For GROUP BY, the database often creates an internal temporary table to aggregate data. If the grouping field has an index, the need for a temporary table is eliminated, significantly improving performance.

Pagination Optimization

Standard pagination using LIMIT offset, size performs poorly for large offsets because the database reads and discards all preceding rows.

-- Slow: Scans 100,010 rows
SELECT * FROM logs LIMIT 100000, 10;

Optimization Strategy

Use a subquery to locate the starting ID, then join the table. This leverages the primary key index for positioning.

-- Fast: Uses Primary Key index
SELECT * FROM logs l
INNER JOIN (
    SELECT id FROM logs ORDER BY id LIMIT 100000, 10
) temp ON l.id = temp.id;

JOIN Operations and Driving Tables

In a JOIN, the database selects a Driving Table (outer loop) and a Driven Table (inner loop). The optimizer generally prefers the smaller table as the driving table. To optimize JOINs:

Ensure the join columns on the driven table are indexed. This allows the database to perform a fast lookup (eq_ref or ref) for each row from the driving table.
If both join columns lack indexes, the database resorts to a Block Nested Loop (BNL), buffering the driving table into memory and scanning the driven table repeatedly, which is computationally expensive.

IN vs. EXISTS

IN: The subquery is executed first. Best suited when the subquery result set is small.
EXISTS: The outer query is executed first, and the subquery is run for each row. Best suited when the outer table is small and the inner table (subquery) has an index on the correlation field.

Analyzing Query Execution with `EXPLAIN`

The EXPLAIN command provides insight into query execution plans.

type: Indicates the access method. const (single row match) and eq_ref (unique index join) are ideal. ALL indicates a full table scan and should be avoided.
Extra: Provides additional optimization info. Using index indicates a covering index. Using filesort or Using temporary suggests performance bottlenecks.

Tags: MySQL Database Optimization SQL Tuning B+ Tree Backend Engineering

Posted on Thu, 21 May 2026 19:57:24 +0000 by johnnyk

Freaks City

Deep Dive into MySQL Index Architecture and Query Optimization

Index Architecture and Storage Structure

The B+ Tree Structure

Comparison with Other Structures

Clustered vs. Non-Clustered Indexes

Joint Indexes and the Leftmost Prefix Principle

Leftmost Prefix Matching

Range Query Impact

Covering Indexes

Index Failure Scenarios

Optimizing `ORDER BY` and `GROUP BY`

Pagination Optimization

Optimization Strategy

JOIN Operations and Driving Tables

IN vs. EXISTS

Analyzing Query Execution with `EXPLAIN`

Hot Tags

Freaks City

Deep Dive into MySQL Index Architecture and Query Optimization

Index Architecture and Storage Structure

The B+ Tree Structure

Comparison with Other Structures

Clustered vs. Non-Clustered Indexes

Joint Indexes and the Leftmost Prefix Principle

Leftmost Prefix Matching

Range Query Impact

Covering Indexes

Index Failure Scenarios

Optimizing ORDER BY and GROUP BY

Pagination Optimization

Optimization Strategy

JOIN Operations and Driving Tables

IN vs. EXISTS

Analyzing Query Execution with EXPLAIN

Hot Tags

Optimizing `ORDER BY` and `GROUP BY`

Analyzing Query Execution with `EXPLAIN`