Index Architecture and Storage Structure
Indexes function as sorted data structures designed to accelerate data retrieval operations such as queries, updates, and sorting. Conceptually, they operate similarly to a library catalog, allowing direct access to specific data locations without scanning the entire dataset. In the absence of an index, a database must perform a full table scan, resulting in significant I/O overhead.
The B+ Tree Structure
MySQL's InnoDB engine utilizes the B+ Tree (Balanced Tree Plus) as its default index structure. In a B+ Tree, non-leaf nodes store only key values and pointers to child nodes, while leaf nodes store the actual data or pointers to data. The leaf nodes are linked via a doubly linked list, facilitating efficient range queries and ordered traversals.
This structure is optimized for disk storage. Since data is stored on disk blocks (pages), the "short and fat" nature of the B+ Tree minimizes the number of disk I/O operations required to locate data. A three-layer B+ Tree can store millions of records, meaning most queries require only three disk I/Os to find the target leaf node.
Comparison with Other Structures
- B+ Tree vs. B Tree: B Trees store data in every node. B+ Trees store data only in leaves. This allows B+ Tree non-leaf nodes to hold more keys, reducing the tree height and I/O operations. Furthermore, the linked list structure of B+ Tree leaves makes range scans significantly faster.
- B+ Tree vs. Hash Index: Hash indexes provide O(1) complexity for exact matches but cannot support range queries, ordering, or fuzzy lookups. B+ Trees support all these operations effectively.
- B+ Tree vs. Binary Search Tree (BST): BSTs can become unbalanced, leading to a tree height equal to the number of nodes in the worst case. A B+ Tree with a degree of roughly 1000 can hold massive amounts of data in just 3 levels, whereas a BST would require 25 levels (and thus 25 I/Os) for the same data volume.
Clustered vs. Non-Clustered Indexes
In InnoDB, indexes are categorized into Clustered Indexes and Secondary (Non-Clustered) Indexes.
- Clustered Index (Primary Key): The leaf nodes of a clustered index contain the actual row data. A table can have only one clustered index. Queries using the primary key are extremely fast as they retrieve data directly without additional lookups.
- Secondary Index: The leaf nodes of a secondary index contain the primary key value of the corresponding row. To retrieve full row data, the database first searches the secondary index to get the primary key and then performs a "bookmark lookup" (table return) using the clustered index. This double lookup is why secondary index queries are slower than primary key queries.
In contrast, MyISAM uses a non-clustered architecture where data and indexes are stored separately. Both primary and secondary indexes in MyISAM point directly to the data row's physical address.
Joint Indexes and the Leftmost Prefix Principle
A joint (composite) index is built on multiple columns, e.g., INDEX(idx_a, idx_b, idx_c). The index is sorted primarily by the first column, then the second, and so on.
Leftmost Prefix Matching
The index can be used effectively only if the query conditions match the order of the columns from left to right.
-- Valid: Matches the prefix
SELECT * FROM orders WHERE region = 'US' AND status = 'NEW';
-- Invalid: Skips 'region', index fails
SELECT * FROM orders WHERE status = 'NEW';
Range Query Impact
If a query includes a range condition (e.g., >, <, BETWEEN) on a column in the joint index, the index optimization stops at that column. Subsequent columns in the index definition cannot be used for searching.
-- 'region' and 'amount' use the index; 'status' does not.
SELECT * FROM sales WHERE region = 'US' AND amount > 1000 AND status = 'NEW';
Covering Indexes
A covering index occurs when a query's SELECT columns are fully contained within the index. In this scenario, the query engine retrieves the result directly from the index leaf nodes without performing a table lookup (returning to the clustered index).
-- If index exists on (region, status)
SELECT region, status FROM orders WHERE region = 'US';
Index Failure Scenarios
Several common patterns can cause an index to be ignored, leading to full table scans:
- Calculations on Columns: Using functions or arithmetic on the indexed column in the
WHEREclause invalidates the index.-- Index fails SELECT * FROM users WHERE YEAR(create_time) = 2023; -- Optimized: Index works SELECT * FROM users WHERE create_time >= '2023-01-01' AND create_time < '2024-01-01'; - Type Conversion: Implicit type casting (e.g., comparing a string column to a numeric value) prevents index usage.
- Leading Wildcard:
LIKE '%abc'scans the entire table.LIKE 'abc%'can use the index. - OR Conditions: If columns connected by
ORare not all indexed, the optimizer may choose a full table scan.UNIONis often a better alternative.
Optimizing ORDER BY and GROUP BY
ORDER BY can utilize an index if the sorting order matches the index order. If not, the database performs a filesort operation in memory or on disk.
For GROUP BY, the database often creates an internal temporary table to aggregate data. If the grouping field has an index, the need for a temporary table is eliminated, significantly improving performance.
Pagination Optimization
Standard pagination using LIMIT offset, size performs poorly for large offsets because the database reads and discards all preceding rows.
-- Slow: Scans 100,010 rows
SELECT * FROM logs LIMIT 100000, 10;
Optimization Strategy
Use a subquery to locate the starting ID, then join the table. This leverages the primary key index for positioning.
-- Fast: Uses Primary Key index
SELECT * FROM logs l
INNER JOIN (
SELECT id FROM logs ORDER BY id LIMIT 100000, 10
) temp ON l.id = temp.id;
JOIN Operations and Driving Tables
In a JOIN, the database selects a Driving Table (outer loop) and a Driven Table (inner loop). The optimizer generally prefers the smaller table as the driving table. To optimize JOINs:
- Ensure the join columns on the driven table are indexed. This allows the database to perform a fast lookup (eq_ref or ref) for each row from the driving table.
- If both join columns lack indexes, the database resorts to a Block Nested Loop (BNL), buffering the driving table into memory and scanning the driven table repeatedly, which is computationally expensive.
IN vs. EXISTS
- IN: The subquery is executed first. Best suited when the subquery result set is small.
- EXISTS: The outer query is executed first, and the subquery is run for each row. Best suited when the outer table is small and the inner table (subquery) has an index on the correlation field.
Analyzing Query Execution with EXPLAIN
The EXPLAIN command provides insight into query execution plans.
- type: Indicates the access method.
const(single row match) andeq_ref(unique index join) are ideal.ALLindicates a full table scan and should be avoided. - Extra: Provides additional optimization info.
Using indexindicates a covering index.Using filesortorUsing temporarysuggests performance bottlenecks.