MySQL Indexing: Data Structures, Types, and Optimization Techniques

Indexes in MySQL are sorted data structures, typically B+ trees, designed to expedite data retrieval. By organizing data in a sorted manner, MySQL can efficiently locate records using a binary search-like approach, significantly reducing the need for full table scans.

Index Data Structures

Binary Search Tree (BST)

A basic BST organizes data such that the left child is always less than the parent, and the right child is greater. While intuitive, BSTs can degenerate into a linked list if data is inserted in a sorted or nearly sorted order (e.g., auto-incrementing IDs). This leads to poor performance, requiring multiple disk reads to find a specific record.

Red-Black Tree

Red-Black trees are self-balancing binary search trees. They automatically adjust their structure upon insertion or deletion to maintain a balanced state, preventing the degeneration seen in simple BSTs. However, for very large datasets, the height of the tree can still become significant, leading to an increased number of disk I/O operations if the index resides at a deep leaf node.

B-Tree

B-trees are an improvement over binary trees. Key characteristics include:

  • Each node can store multiple keys and pointers.
  • Keys within a node are sorted.
  • All leaf nodes are at the same depth.

While B-trees reduce tree height compared to binary trees, each node stores data. In the context of databases, minimizing disk I/O is paramount. B+ trees offer a more optimized approach by separating data storage.

B+ Tree

The B+ tree is the most commonly used index structure in databases like MySQL. It offers several advantages:

  • Non-leaf nodes only store keys: This allows for more keys per node, reducing the tree's height and thus disk I/O.
  • Leaf nodes contain all data: All actual data records are stored in the leaf nodes.
  • Sequential linking of leaf nodes: Leaf nodes are linked together, facilitating efficient range queries.

This structure optimizes for disk reads by keeping the tree shallow and by placing all necessary data at the leaf level, accessible via sequential traversal.

Index Types

Clustered Index

A clustered index determines the physical order of data in the table. The table rows are stored and sorted based on the clustered index key. Typically, the primary key is implemented as a clustered index. Each table can only have one clustered index.

Non-Clustered Index (Secondary Index)

A non-clustered index contains a copy of the indexed columns and a pointer to the actual data row. The data rows themselves are not stored in the order of the non-clustered index. A table can have multiple non-clustered indexes.

Covering Index

A covering index occurs when all the columns required for a query are present with in the index itself. This means MySQL does not need to perform an additional lookup (a "back-to-table" or "bookmark lookup") to retrieve the requested data, leading to significant performance improvements. For example, if an index exists on (name, age), a query selecting only name and age can be satisfied entirely by the index.

Leftmost Prefix Rule

For composite (multi-column) indexes, the order of columns is crucial. The "leftmost prefix" rule states that MySQL can utilize the index efficiently only if the query conditions include the leftmost columns of the composite index in order. Skipping a column on the left invalidates the index for subsequent columns in that specific query.

Range Query on First Column: If the first column of a composite index is used in a range query (e.g., name > 'a'), MySQL might not use the index for subsequent columns. This is because a wide range on the first column could result in a large number of rows, making further index filtering less efficient than a full table scan in some scenarios. However, this behaviorr can be mitigated by ensuring the query also benefits from a covering index.

EXPLAIN Statement

The EXPLAIN statement is a powerful tool for analyzing the execution plan of SQL queries. By prefixing a SELECT statement with EXPLAIN, MySQL provides detailed information about how it intends to execute the query, highlighting potential performance bottlenecks.

EXPLAIN Output Columns

  • id: The sequence number of the query. Higher numbers typically execute first.

  • select_type: The type of SELECT query (e.g., SIMPLE, PRIMARY, SUBQUERY, DERIVED).

  • type: The join type or access method used to retrieve rows. Performance ranking (best to worst): SYSTEM, CONST, EQ_REF, REF, RANGE, INDEX, ALL.

  • SYSTEM/CONST: Query returns at most one row (e.g., using primary key on a constant value).

  • EQ_REF: A join where each row from the preceding table is matched by exactly one row in the current table, typically using a primary or unique key.

  • REF: A join where rows from the preceding table are matched by multiple rows in the current table using a non-unique index or a unique index where duplicates are allowed.

  • RANGE: Index used to retrieve rows within a specific range (e.g., >, <, BETWEEN).

  • INDEX: Full index scan. All rows in the index are scanned, but it might be faster than ALL if it's a covering index.

  • ALL: Full table scan. All rows in the table are examined.

  • possible_keys: Indexes that MySQL could potentially use.

  • key: The actual index chosen by MySQL.

  • key_len: The length of the chosen key in bytes. Useful for determining which parts of a composite index are used.

  • ref: Shows which columns or constants are compared to the index named in the 'key' column.

  • rows: An estimate of the number of rows MySQL must examine to execute the query.

  • filtered: An estimated percentage of rows filtered by the table condition (WHERE clause). Calculated as rows * filtered / 100.

  • Extra: Provides additional information about how MySQL resolves the query. Key values include:

  • Using index: Indicates a covering index was used, and the entire row was retrieved from the index.

  • Using where: Indicates that the WHERE clause is used to filter rows after they have been retrieved from storage. This can happen with or without an index.

  • Using index condition: Introduced in MySQL 5.6, this refers to the Index Condition Pushdown optimization. Part of the WHERE clause is evaluated at the storage engine level using the index, reducing the number of rows passed to the MySQL server.

  • Using filesort: MySQL must do an extra pass to sort the rows because the order is not determined by the index.

  • Using join buffer: When a join condition doesn't use an index, MySQL uses a join buffer to store rows from one table for later comparison with another table.

Index Pushdown

Index Pushdown is an optimization where MySQL can evaluate parts of the WHERE clause at the storage engine level using the index, rather than retrieving all rows from the index and then filtering them at the server level. This is particularly useful for composite indexes where conditions apply to multiple indexed columns. For example, with an index on (age, name), a query like WHERE age > 10 AND name = 'zhangsan' can use index pushdown: the storage engine finds entries where age > 10 and then filters those results further by checking if name = 'zhangsan' within the index itself before returning rows to the server.

Index Invalidation Scenarios

  • Leading wildcard in LIKE: Queries like LIKE '%value%' cannot use standard B-tree indexes effectively.
  • Violation of the leftmost prefix rule: Skipping columns in a composite index.
  • Operations on indexed columns: Applying functions, calculations, or type conversions to indexed columns in the WHERE clause prevents index usage.
  • Range query on the first column of a composite index: As mentioned, this can sometimes lead to index inefficiency for subsequent columns.
  • LIKE 'prefix%': Queries like LIKE 'prefix%' generally leverage indexes. Endex condition pushdown can further optimize composite indexes with such conditions.

Optimizing ORDER BY and GROUP BY

Indexes can significantly speed up sorting and grouping operations if the indexed columns match the ORDER BY or GROUP BY clauses, following the leftmost prefix rule. If the sorting order required by the query cannot be satisfied by an index, MySQL performs an explicit sort operation, resulting in Using filesort in the EXPLAIN output.

  • Case 1: ORDER BY age on (name, age, position) index works because 'name' and 'age' form a prefix.
  • Case 2: ORDER BY position on (name, age, position) index results in Using filesort because 'position' is not the leftmost column.
  • Case 3: ORDER BY age, position on (name, age, position) index works because 'name', 'age', and 'position' are all part of the index in the correct order.
  • Case 4: ORDER BY position, age on (name, age, position) index results in Using filesort because the order of 'position' and 'age' is reversed relative to the index definition.

Common Optimization Scenarios

Pagination

Standard LIMIT offset, count queries can be inefficient for large offsets because MySQL still has to retrieve and discard offset rows. Two common optimization strategies are:

  1. Using a primary key range: If the primary key is a sequential, non-gapped integer, queries like WHERE id > last_seen_id LIMIT count are very efficient as they scan a small portion of the index.

  2. Using a covering index for subqueries: A subquery can select just the IDs using a covering index, and then the outer query joins back to the main table using these IDs. This avoids fetching unnecessary columns in the subquery. ```

         SELECT * FROM employees e1
         INNER JOIN (SELECT id FROM employees ORDER BY id LIMIT 1000, 10) e2 ON e1.id = e2.id;
    
    
    

Join Queries

MySQL uses different join algorithms: Nested Loop Join (NLJ) and Block Nested Loop Join (BNLJ).

  • Nested Loop Join (NLJ): Used when the join condition on the *driven* table has an index. For each row in the outer (driver) table, MySQL searches the inner (driven) table using the index. Disk reads: Driver rows + (Driven rows * Index lookups).
  • Block Nested Loop Join (BNLJ): Used when the join condition on the driven table lacks an index. The driving table's rows are buffered in memory (join buffer), and then the entire driven table is scanned, comparing each row against the buffer. Disk reads: Driver rows + Driven rows (if buffer is full) or Driver rows * Driven rows (if buffer is small and fills up multiple times).

Driver vs. Driven Table: In an INNER JOIN, MySQL usually picks the smaller result set as the driver. In LEFT JOIN, the left table is the driver; in RIGHT JOIN, the right table is the driver. The STRAIGHT_JOIN keyword forces MySQL to use the tables in the specified order.

Example: Joining table t1 (10,000 rows) with t2 (100 rows) on t2.a = t1.a (where t1.a has an index):

  • If t1 is the driven table, it uses NLJ. MySQL reads 100 rows from t2 and performs 100 index lookups on t1.a. Total disk reads ≈ 100 (t2) + 100 * (index reads for t1) ≈ moderate.
  • If t1 is the driven table and t1.a has no index, it uses BNLJ. MySQL buffers 100 rows from t2 into a join buffer. Then it scans 10,000 rows from t1, comparing each against the buffer. Total disk reads ≈ 100 (t2) + 10000 (t1). This is generally better than a full NLJ without an index on the driven table, which would be 100 * 10000 = 1,000,000 disk reads.

Index Design Principles

  1. Code First, Index Later: Design indexes after the main business logic and critical SQL queries are developed and analyzed, rather than immediately upon table creation.
  2. Composite Indexes for Coverage: Aim to create composite indexes that cover fields used in WHERE, ORDER BY, and GROUP BY clauses, respecting the leftmost prefix rule. Minimize the use of single-column indexes.
  3. Avoid Low Cardinality Fields: Do not create indexes on columns with very few distinct values (low cardinality, e.g., gender). Such indexes offer little benefit and might be slower than a full table scan. Prefer columns with high cardinality (many distinct values).
  4. Prefix Indexes for Long Strings: For long string columns (e.g., VARCHAR(255)), consider using prefix indexes (e.g., KEY(column_name(20))) to save space and improve performance for equality searches. However, prefix indexes may not be effective for ORDER BY or GROUP BY operations on that column.
  5. Prioritize WHERE over ORDER BY for Conflicts: When index design conflicts arise between WHERE and ORDER BY clauses, prioritize optimizing the WHERE clause for fast data filtering. Sorting the already filtered, smaller dataset is generally more efficient.

Tags: MySQL database indexing B+Tree SQL Optimization

Posted on Fri, 15 May 2026 13:53:37 +0000 by pieai