Decoupling Computation and Scheduling in AI Frameworks for Hardware Optimization

In the previous discussion, we introduced what operators are. Neural network models consist of a large number of operators, but how do these operators execute? And how does the algorithmic logic within an operator interact with the scheduling of specific hardware instruction code?

Computation and Scheduling

The Origin of Computation and Scheduling

Image processing is a fundamental and computationally expensive application in today's physical world. Practical image processing algorithms require efficient implementations, especially on power-constrained mobile devices. As algorithms and hardware architectures continuously evolve, it becomes increasingly difficult for developers to write efficient image processing code. This challenge cannot be resolved merely by programming in languages like C or C++; the performance difference between naive C code and highly optimized C implementations often reaches orders of magnitude.

Furthermore, efficient execution on modern hardware demands complex global transformations of computation and data structures. Such optimizations usually come at the cost of programmer effort and code complexity, as computations must be reorganized to effectively exploit memory hierarchies and parallel execution hardware.

The root of this challenge lies in the fact that traditional programming languages conflate the definition of an image processing algorithm with the organization of its computation and data on the underlying machine. This makes it difficult to write algorithms, compose them into larger applications, or organize them for efficient execution on a given machine.

Consider Halide as an example. Halide's unique advantage is the separation of algorithm definition from the organization of computation. As a powerful programming tool, it makes image processing programs simpler while achieving performance that is orders of magnitude faster than previous manual tuning methods. Its benefit is that tuning becomes easy regardless of the processor type. Moreover, it makes code easier to compose and modify, which is notoriously difficult in traditional implementation approaches.

Halide is a Domain-Specific Language (DSL) for image processing that uses C++ as its host language. Its primary role is to achieve low-level acceleration of algorithms at the software-hardware interface, independent of the algorithm's high-level design.

In Halide, "computation" defines how the output image pixel values are generated, encompassing simple pixel-level operations, complex algorithmic expressions, and various transformation and filtering operations. Halide provides a rich set of schedulers to help users optimize their computation graph, including techniques such as parallelization, vectorization, and memory layout optimization, giving users flexible control over the execution of computation. By decoupling computation from its implementation, Halide allows for more efficient design of the algorithm's specific execution process, enabling users to focus on low-level acceleration.

The Meaning of Computation and Scheduling

There is a well-known story about Gauss deriving the summation formula. A teacher asked students to calculate the sum of numbers from 1 to 100. Following the naive approach means adding each number sequentially. However, Gauss quickly found a simpler method. He observed that when these numbers are paired symmetrically (e.g., 1 and 100, 2 and 99, 3 and 98, etc.), the sum of each pair is 101. Since there are 50 such pairs, he obtained the sum simply by multiplying 50 by 101 to get 5050.

Similar "shortcuts" exist in computer computation. A computation can be executed step by step in its most primitive form, or it can be accelerated by leveraging specialized hardware components such as dedicated storage or compute units. This forms a one-to-many mapping: a single computation can have multiple different implementations, each with its own strengths and weaknesses depending on the scenario, input, machine, and parameters. There is no single optimal implementation covering all cases. Against this backdrop, two concepts are separated:

Computation: Describes the specific logic for implementing the algorithm without concern for the concrete code implementation.
Scheduling: The process of optimizing and controlling the computation. Through scheduling, one can specify execution order, memory layout, and parallelization strategies to optimize computational performence.

In neural networks, deep learning algorithms are composed of individual computational units called Operators (Ops). An operator is a mapping from one function space to another, denoted as ( O: X \to Y ). Broadly speaking, any operation on a function can be considered an operator. For AI frameworks, the developed operators are the computational functions involved in the network model.

Matrix multiplication is one of the most common operators in neural networks. The formula is:

[ C_{ij} = \sum_{k=1}^{n} A_{ik} \cdot B_{kj} ]

Its most straightforward implementation is shown below:

void naiveMatMul(int A[][128], int B[][128], int C[][128], int dimension) {
    for (int row = 0; row < dimension; ++row) {
        for (int col = 0; col < dimension; ++col) {
            C[row][col] = 0;
            for (int k = 0; k < dimension; ++k) {
                C[row][col] += A[row][k] * B[k][col];
            }
        }
    }
}

We can optimize it using loop tiling:

void tiledMatMul(int A[][128], int B[][128], int C[][128], int dimension, int block) {
    for (int i_out = 0; i_out < dimension; i_out += block) {
        for (int j_out = 0; j_out < dimension; j_out += block) {
            for (int k_out = 0; k_out < dimension; k_out += block) {
                for (int i_in = i_out; i_in < i_out + block; ++i_in) {
                    for (int j_in = j_out; j_in < j_out + block; ++j_in) {
                        int accumulator = 0;
                        for (int k_in = k_out; k_in < k_out + block; ++k_in) {
                            accumulator += A[i_in][k_in] * B[k_in][j_in];
                        }
                        C[i_in][j_in] += accumulator;
                    }
                }
            }
        }
    }
}

Alternatively, we can use vectorization:

#include <immintrin.h>

void vectorizedMatMul(int A[][128], int B[][128], int C[][128], int dimension) {
    for (int row = 0; row < dimension; ++row) {
        for (int col = 0; col < dimension; col += 4) {
            __m128i row_broadcast = _mm_set1_epi32(A[row][col]);
            for (int k = 0; k < dimension; ++k) {
                __m128i b_vec = _mm_loadu_si128((__m128i*)&B[k][col]);
                __m128i prod = _mm_mullo_epi32(row_broadcast, b_vec);
                __m128i c_vec = _mm_loadu_si128((__m128i*)&C[row][col]);
                __m128i updated = _mm_add_epi32(c_vec, prod);
                _mm_storeu_si128((__m128i*)&C[row][col], updated);
            }
        }
    }
}

We can employ even more optimization techniques for matrix multiplication, or combine them. All three operations above have the same algorithmic function, but their speeds differ. This difference is strongly tied to hardware design. Computers incorporate many special features to accelerate computation, such as memory hierarchies, vector accelerators, and multiple cores. When we fully exploit these hardware characteristics, we can dramatically increase program execution speed, with optimized implementations running tens or even hundreds of times faster than the naive version.

The collection of all possible scheduling approaches for an operator's execution is called the scheduling space. The goal of AI compiler optimization is to select the best schedule for an operator so that its runtime on specific hardware reaches an optimal level. This optimization involves a thorough search and analysis of the operator's scheduling space to determine the most suitable schedule for the current hardware architecture, aiming to maximize hardware utilization, improve operator execution efficiency, and ultimately achieve high-performance execution of the overall computational task.

Fundamentals of the Schedule Tree

When constructing an operator's scheduling space, we must first identify which optimization techniques are available. Taking Halide again as an example, common optimizations include Reorder, Split, Fuse, Tile, Vectorize, Unroll, and Parallelize. AI compilers like TVM, guided by the Halide philosophy, inherit these optimization approaches:

Reorder: Rearranges the order of computation loops. This can change dependency relationships and help improve cache hit rates and reduce memory access latency.
Split: Divides a single loop dimension into multiple smaller dimensions, which aids parallelization, vectorization, and optimization of memory access patterns.
Fuse: Merges multiple computation loops to reduce memory access and data transfer overhead, improving computation locality and reducing scheduling overhead.
Tile: Partitions computation into smaller blocks, facilitating parallelism, vectorization, and improved cache utilization.
Vectorize: Packs multiple data elements into vector operations to fully utilize SIMD instruction sets, improving the efficiency of compute-intensive tasks.
Unroll: Expands loop bodies to reduce loop control overhead, decrease branch misprediction, and increase instruction-level parallelism.
Parallelize: Decomposes a computational task into multiple concurrently executing subtasks to leverage multi-core processors or parallel processing units, increasing overall computation speed.

For operators in neural networks, their computational form is generally quite regular, consisting of deeply nested loops with few complex control flows, and inputs are primarily multi-dimensional tensors. After analyzing the computation characteristics, we analyze the elements of scheduling. A computation first requires memory allocation to hold inputs, then proceeds with multi-level nested loop computation, producing a final result that is stored back to the output location.

// src is the input original image, dst is the output blurred image
void box_blur_3x3(const Mat &src, Mat &dst) {
    Mat intermediate(src.size(), src.type());  // Storage

    for (int x = 1; x < src.cols - 1; ++x)
        for (int y = 0; y < src.rows; ++y)   // Loop
            intermediate.at<uint8_t>(y, x) = static_cast<uint8_t>(
                (src.at<uint8_t>(y, x - 1) + src.at<uint8_t>(y, x) + src.at<uint8_t>(y, x + 1)) / 3);  // Computation

    for (int x = 0; x < src.cols; ++x)
        for (int y = 1; y < src.rows - 1; ++y) // Loop
            dst.at<uint8_t>(y, x) = static_cast<uint8_t>(
                (intermediate.at<uint8_t>(y - 1, x) + intermediate.at<uint8_t>(y, x) + intermediate.at<uint8_t>(y + 1, x)) / 3);  // Computation
}

Based on these scheduling elements, we can abstract them into a tree structure called a Schedule Tree:

Loop Node: Represents how a function is traversed and computed along a given dimension. A loop node is associated with a function and a variable (dimension). The loop node also contains information about weather the loop runs sequentially, in parallel, or is vectorized.
Storage Node: Represents the storage allocated for intermediate results to be used later.
Compute Node: The leaves of the schedule tree, representing the computation being performed. A compute node can have other compute nodes as children to represent inlined functions rather than loading from intermediate storage.

A schedule tree must satisfy several constraints to be legal:

Functions must be computed before they are used: In a depth-first traversal of the schedule tree, a function's compute node must appear before the compute nodes of functions that call it.
Storage must be allocated and within the scope where it is to be used: The storage node for a function must be an ancestor of its own compute node and the compute nodes of its callers.
Some patterns are illegal due to practical code generation limitations. Specifically, vectorization is only allowed on the innermost loop (a loop node that is not an ancestor of any other loop node), and only loops with a definite, known width can be vectorized or unrolled.

For any operator, we can define its default schedule. It traverses all output in row-majer order and inlines all function calls, as illustrated below:

Default schedule tree structure with root at top, containing a single compute node that directly represents the function result.

We can map the schedule tree back to the original program:

Mapping of loop and compute nodes in the tree to corresponding nested for-loops and code blocks in the program.

Given a schedule tree, the corresponding code can be generated through a depth-first recursive traversal:

If visiting a loop node, begin the corresponding loop with appropriate attributes (parallel, vectorized, unrolled), recursively visit each child node in order, and then close the loop.
If visiting a storage node, allocate the corresponding storage, recursively visit each child node, and then deallocate the storage.
If visiting a compute node, compute the value of the corresponding function at the position defined by its enclosing loops' domain, and store this value into its allocated storage.

This demonstrates the benefit of separating computation and scheduling. For a single computation, multiple schedule trees can generate programs with different performance characteristics. As long as a schedule tree is legal, we can improve program performance while ensuring correct results.

Schedule Transformation Techniques

Halide Schedule Transformations

Many optimization techniques can be applied, reflected as transformations on the schedule tree. Halide provides a clean, encapsulated API for this. Consider the original code:

Var x("x"), y("y");
Func gradient("gradient");
gradient(x, y) = x + y;
// 'realize' compiles and executes the operation defined above
Buffer<int> output = gradient.realize(4, 4);

This translates to the following C++ pseudocode:

for (int y = 0; y < 4; ++y) {
    for (int x = 0; x < 4; ++x) {
        printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
    }
}

Now, let's apply some scheduling transformations. For instance, Fuse:

Var fused;
gradient.fuse(x, y, fused);

// Corresponding C++ code
for (int fused = 0; fused < 4 * 4; ++fused) {
    int y = fused / 4;
    int x = fused % 4;
    printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
}

In the schedule tree, this transformation merges two adjacent loop nodes of the same function into a single loop node. The new node retains the original outer loop node's position in the tree, and the children from both original nodes are concatenated, with the outer variable's children preceding the inner variable's children.

Schedule tree before and after fuse operation, showing two adjacent loop nodes being replaced by one combined fused loop node.

In the next step, we first tile the x and y dimensions with a factor of 4. Then we fuse the outer loops of x and y (2+2=4), and finally parallelize the resulting fused loop.

Var x_outer, y_outer, x_inner, y_inner, tile_index;
gradient.tile(x, y, x_outer, y_outer, x_inner, y_inner, 4, 4);
gradient.fuse(x_outer, y_outer, tile_index);
gradient.parallel(tile_index);

// Corresponding C++ code
// This outermost loop should be a parallel for loop.
for (int tile_index = 0; tile_index < 4; ++tile_index) {
    int y_outer = tile_index / 2;
    int x_outer = tile_index % 2;
    for (int y_inner = 0; y_inner < 4; ++y_inner) {
        for (int x_inner = 0; x_inner < 4; ++x_inner) {
            int y = y_outer * 4 + y_inner;
            int x = x_outer * 4 + x_inner;
            printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
        }
    }
}

Using Parallel in a schedule tree changes the loop type attribute to parallelized; similarly, for sequential, vectorized, or unrolled loops, only the corresponding loop node's attribute is modified.

Schedule tree with a loop node's attribute set to 'parallel', illustrating how the execution model for that loop shifts.

A complete operator implemented in Halide follows this style:

Func blur_3x3(Func input) {
  Func blur_x, blur_y;
  Var x, y, xi, yi;

  // The algorithm - no storage or order
  blur_x(x, y) = (input(x-1, y) + input(x, y) + input(x+1, y))/3;
  blur_y(x, y) = (blur_x(x, y-1) + blur_x(x, y) + blur_x(x, y+1))/3;

  // The schedule - defines order, locality; implies storage
  blur_y.tile(x, y, xi, yi, 256, 32)
        .vectorize(xi, 8).parallel(y);
  blur_x.compute_at(blur_y, x).vectorize(x, 8);

  return blur_y;
}

Other Schedule Optimizations

Beyond these, other optimization types can also apply corresponding transformations to the schedule tree.

A valid schedule tree defines a possible scheduling space for the algorithm, and transformations are operators mapping points within this space. Different schedule trees correspond to different program implementations with varying performance. How can we obtain an optimal schedule tree?

The simplest approach would be to determine the optimal schedule tree through static analysis. Once loop sizes are known, we have enough information to determine key execution characteristics of a schedule tree, such as how much memory it will allocate and how many operations it will execute. The cost of a schedule is the weighted sum of these data points.

Illustration of a static cost model weighing factors like compute operations, memory allocation, and loop overhead to estimate total schedule cost.

However, this method is overly simplistic and naive. First, we cannot precisely determine the cost of every operation; only rough estimates are possible. Second, these operations are interdependent and not independent, meaning costs are dynamic. The total cost is also not a simple linear superposition. In short, finding an optimal schedule tree is an extremely complex process.

Currently, mainstream methods like those used in TVM adopt an automated tuning approach. They combine available optimization techniques to generate a very large scheduling space. Then, using search strategies such as heuristic algorithms or machine learning models, they explore this space by either running the code or predicting its performance via models. Based on actual performance feedback, the exploration of the schedule space is guided, ultimately selecting a relatively optimal schedule within a given time budget.

Diagram of an automated tuning loop: a search algorithm queries a cost model or real hardware to evaluate candidate schedule trees and iteratively refines the search.

Tags: AI Compiler Schedule Tree Halide Operator Optimization TVM

Posted on Sat, 27 Jun 2026 16:32:31 +0000 by BlaineSch

Freaks City