Parallel Matrix Multiplication in Julia

In this notebook, we will:

  • Parallelize a simple algorithm
  • Learn about the performance of different parallel strategies
  • Implement the parallelization in Julia

Problem Description

Asssumptions

  • All matrices, including A, B, and C, are initially stored on the master process.
  • The final result will overwrite the matrix C.

Steps

To implement parallelization, we will follow these steps:

  • Identify the parts of the sequential algorithm that can be parallelized.
  • Consider different parallelization strategies.
  • Discuss the performance of these implementations.

Serial Implementation

We start with the most basic serial implemantation, based on the mathematical definition of matrix multiplication:

@everywhere function matmul_seq!(C, A, B)
    m = size(C, 1)
    n = size(C, 2)
    l = size(A, 2)
    @assert size(A, 1) == m
    @assert size(B, 2) == n
    @assert size(B, 1) == l
    z = zero(eltype(C))
    for j in 1:n
        for i in 1:m
            Cij = z
            for k in 1:l
                @inbounds Cij += A[i, k] * B[k, j]
            end
            C[i, j] = Cij
        end
    end
    C
end

Note: As mentioned above, implementing simple matrix multiplication with three nested loops is inefficient (memory-bound). Libraries like BLAS implement more efficient implementations, which are used in practice (e.g., via the "*" operator in Julia). We consider our hand-written implementation a simple way to express the algorithm we are interested in.

Running the code below to compare our hand-written function with the built-in mul! function, we can see that the built-in function performs much better.

The answer is obviously O(N^3), as there are three nested loops for traversal, two to traverse C and fill its elements, and the innermost to calculate.

How can we implement parallelization?

Observing the nested loops, we can see that:

  • The i and j loops can be parallelized.
  • The k loop can also be parallelized, but it requires reduction.

Parallel Algorithms

All entries of matrix C can be computed in parallel, but is it the most efficient solution to solve all these entries in parallel in a distributed system? To find the answer, we will consider different parallel strategies:

  • Algorithm 1: Each worker computes a single entry of C.
  • Algorithm 2: Each worker computes a single row of C.
  • Algorithm 3: Each worker computes N/P rows of C.

Parallel Algorithm 1

Data Dependencies

Moving data over the network is expensive, and minimizing data movement is one of the key points in distributed computing. Therefore, we determine the minimum data that a worker needs to perform the computation.

In Algorithm 1, each worker only computes a single entry of the result matrix C.

The answer is obviously B, we need a row of A and a column of B.

Implementation

Considering the data dependencies, we can effectively implement parallel Algorithm 1 from the worker's perspective as follows:

  1. The worker receives the corresponding row A[i, :] and column B[:, j] from the master.
  2. The worker computes the dot product of the row and column.
  3. The worker sends the result C[i, j] back to the master process.

The implementation of this algorithm in Julia might look like this:

function matmul_dist_1!(C, A, B)
    m = size(C, 1)
    n = size(C, 2)
    l = size(A, 2)
    @assert size(A, 1) == m
    @assert size(B, 2) == n
    @assert size(B, 1) == l
    z = zero(eltype(C))
    @assert nworkers() == m * n
    iw = 0
    @sync for j in 1:n
        for i in 1:m
            Ai = A[i, :]
            Bj = B[:, j]
            iw += 1
            w = workers()[iw]
            ftr = @spawnat w begin
                Cij = z
                for k in 1:l
                    @inbounds Cij += Ai[k] * Bj[k]
                end
                Cij
            end
            @async C[i, j] = fetch(ftr)
        end
    end
    C
end

Performance

Now we need to study the performance of the algorithm. For this, we will analyze whether Algorithm 1 can achieve optimal speedup. The parallel speedup on P accelerators is defined as:

Experimental speedup

Next, we measure the speedup of Algorithm 1. Have we achieved optimality?

N = 2
A = rand(N, N)
B = rand(N, N)
C = similar(A)
T1 = @belapsed matmul_seq!(C, A, B)
C = similar(A)
TP = @belapsed matmul_dist_1!(C, A, B)
P = nworkers()
println("Speedup = ", T1 / TP)
println("Optimal speedup = ", P)
println("Efficiency = ", 100 * (T1 / TP) / P, "%")

Communication overhead

Since communication is often the main bottleneck in distributed algorithms, we want to reduce the communication per unit of computation in the worker threads. Let's calculate the communication overhead of Algorithm 1, which will help us understand why its speedup is so poor.

The answer is b, obviously.

The answer is b, there is only one loop, calculating the product.

Therefore, we can conclude:

In other words, the communication cost and the computation cost are of the same order of magnitude. Since communication is several orders of magnitude slower than computation in real systems, the runtime of the worker threads is mainly determined by communication. This explains why our speedup is so poor.

Parallel Algorithm 2

Let's study the next algorithm and see if we can improve efficiency by increasing the granularity of each parallel task (i.e., the workload). In Parallel Algorithm 2, each worker computes an entire row of C.

Data dependencies

The answer is d, because computing a row of C requires multiplying a row of A by each column of B in sequance.

Implementation

The main implementation steps are as follows:

  1. The worker receives the corresponding row of matrix A, A[i, :], and matrix B from the master process.
  2. The worker computes the product of the row of A and matrix B.
  3. The worker sends the result (the row of matrix C, C[i, :]) back to the master process.

The Julia code implementation is as follows:

function matmul_dist_2!(C, A, B)
    m = size(C, 1)
    n = size(C, 2)
    l = size(A, 2)
    @assert size(A, 1) == m
    @assert size(B, 2) == n
    @assert size(B, 1) == l
    z = zero(eltype(C))
    @assert nworkers() == m
    iw = 0
    @sync for i in 1:m
        Ai = A[i, :]
        iw += 1
        w = workers()[iw]
        ftr = @spawnat w begin
            Ci = fill(z, n)
            for j in 1:n
                for k in 1:l
                    @inbounds Ci[j] += Ai[k] * B[k, j]
                end
            end
            Ci
        end
        @async C[i, :] = fetch(ftr)
    end
    C
end

Experimental speedup

N = 4
A = rand(N, N)
B = rand(N, N)
C = similar(A)
T1 = @belapsed matmul_seq!(C, A, B)
C = similar(A)
TP = @belapsed matmul_dist_2!(C, A, B)
P = nworkers()
println("Speedup = ", T1 / TP)
println("Optimal speedup = ", P)
println("Efficiency = ", 100 * (T1 / TP) / P, "%")

Complexity

The speedup is still far from optimal. Let's study the communication overhead of this algorithm. Remember that Algorithm 2 includes the following steps:

  1. The worker receives the corresponding row A[i, :] and matrix B from the master process.
  2. The worker computes the product of A[i, :] and B.
  3. The worker sends the result row C[i, :] back to the master process.

The answer is b, at this point, communication needs to transmit the entire matrix B, so it is O(N^2), and the computation is also two nested loops, with a complexity of O(N^2).

We can see that even if we increase the granularity, communication and computation are still of the same order of magnitude.

Parallel Algorithm 3

Let's further increase the granularity of the parallel tasks by computing multiple rows of C in the worker, with each worker computing N/P consecutive rows of C, where P is the number of workers.

Data dependencies

The answer is c, we only need the corresponding rows of A and the entire matrix B.

Implementation

The main steps of Algorithm 3 are as follows:

  1. The worker receives the corresponding rows A[rows, :] and matrix B from the master process.
  2. The worker computes the product of A[rows, :] and B.
  3. The worker sends the result C[rows, :] back to the master.

Communication overhead

The answer is b, the communication computation overhead is still O(N^2), so what about the computation overhead?

First, there needs to be an N/P loop, then an N loop, and then for each element, there needs to be an N loop. Therefore, it is O(N^3/P).

In this case, the ratio between communication and computation is O(P/N). If the matrix size N is much larger than the number of workers P, the communication overhead O(P/N) can be neglected, making a scalable implementation possible.

Summary

The following table summarizes the three algorithms:

  • Matrix multiplication can be computed in parallel, at least in theory, as all entries in the result matrix can be computed in parallel.
  • However, due to communication overhead, we cannot utilize all the potential parallelism in a distributed system.
  • We need a sufficiently large granularity to approach optimal speedup.

Exercise

Exercise 1

Implement the above Algorithm 3. For simplicity, assume that the number of rows of C is a multiple of the number of workers.

function matmul_dist_3!(C, A, B)
    m = size(C, 1)
    n = size(C, 2)
    l = size(A, 2)
    @assert size(A, 1) == m
    @assert size(B, 2) == n
    @assert size(B, 1) == l
    @assert mod(m, nworkers()) == 0
    nrows_w = div(m, nworkers())
    @sync for (iw, w) in enumerate(workers())
        lb = 1 + (iw - 1) * nrows_w
        ub = iw * nrows_w
        A_w = A[lb:ub, :]
        ftr = @spawnat w begin
            C_w = similar(A_w)
            matmul_seq!(C_w, A_w, B)
            C_w
        end
        @async C[lb:ub, :] = fetch(ftr)
    end
    C
end

@everywhere function matmul_seq!(C, A, B)
    m = size(C, 1)
    n = size(C, 2)
    l = size(A, 2)
    @assert size(A, 1) == m
    @assert size(B, 2) == n
    @assert size(B, 1) == l
    z = zero(eltype(C))
    for j in 1:n
        for i in 1:m
            Cij = z
            for k in 1:l
                @inbounds Cij = Cij + A[i, k] * B[k, j]
            end
            C[i, j] = Cij
        end
    end
    C
end

The test passes. Next, let's look at the speedup and other performance metrics.

Exercise 2

The implementation of Algorithm 1 is very impractical. We need as many processors as there are entries in the result matrix C. For a 1000*1000 matrix, we would need a supercomputer with a million processes. We can easily solve this by using fewer processors and generating the computation of entries in any available process.

function matmul_dist_1_v2!(C, A, B)
    m = size(C, 1)
    n = size(C, 2)
    l = size(A, 2)
    @assert size(A, 1) == m
    @assert size(B, 2) == n
    @assert size(B, 1) == l
    z = zero(eltype(C))
    @sync for j in 1:n
        for i in 1:m
            Ai = A[i, :]
            Bj = B[:, j]
            ftr = @spawnat :any begin
                Cij = z
                for k in 1:l
                    @inbounds Cij += Ai[k] * Bj[k]
                end
                Cij
            end
            @async C[i, j] = fetch(ftr)
        end
    end
    C
end

With this new implementation, we can compute matrix multiplication of any size, and the number of workers is fixed.

Run the code below to test the performance of the implementation. Note that we are still far from the optimal speedup. Why? To answer this question, we calculate the communication and computation ratio of this implementation and reason about the results obtained. In this implementation, the number of times a worker is spawned is on average N^2/P.

Tags: Julia Parallel Computing matrix multiplication Distributed Systems performance analysis

Posted on Wed, 20 May 2026 04:43:01 +0000 by reeferd