This article presents the ESPNet series, a specialized network architecture designed for semantic segmentation of high-resolution images. The framework achieves remarkable efficiency in computational complexity, memory footprint, and power consumption. The core contribution lies in the Efficient Spatial Pyramid (ESP) module, which forms the foundational building block of this architecture. The subsequent ESPNet V2 introduces group convolution and depth-wise dilated separable convolutions to further expand the effective receptive field while minimizing floating-point operations and model parameters.
ESPNet V1 Architecture
ESPNet V1 is specifically engineered for high-resolution image semantic segmentation, delivering exceptional efficiency across computational demands, memory utilization, and power consumption. The primary innovation involves the Efficient Spatial Pyramid module, which decomposes standard convolutions into point-wise operations followed by a dilated convolution pyramid. This decomposition substantially reduces computational overhead while maintaining competitive segmentation quality.
Efficient Spatial Pyramid Module
The ESP module employs convolution factorization principles, decomposing standard convolutions into point-wise convolutions and a spatial pyramid of dilated convolutions. The point-wise convolution projects input feature maps into a lower-dimensional space using K kernels of size 1×1×M. This dimansionality reduction through 1×1 convolutions significantly decreases parameter count. The dilated convolution pyramid employs K parallel branches with varying dilation rates while performing downsampling to obtain low-dimensional representations.
The factorization approach dramatically reduces both parameters and memory consumption within the ESP module while preserving an expanded receptive field. The module implements a reduce-split-transform-merge strategy: first reducing dimensionalities, then splitting into parallel branches, transforming through dilated convolutions, and finally merging outputs.
Considering the parameter analysis, the first component uses d kernels of size 1×1×M to project M-dimensional input features to d dimensions, resulting in M*N/K parameters. The second component contributes K*n²*(N/K) parameters. Compared to standard convolution structures, this factorization substantially reduces model complexity.
The hyperparameter K unifies channel dimensions across ESP modules. During reduction, the module first applies point-wise convolution to project feature maps from m dimensions to N/K dimensions. The split operation distributes low-dimensional feature maps across K parallel branches. Each branch processes features using n×n kernels with dilation rates following 2^(k-1) for k=1,...,K. The K parallel dilated convolution outputs are concatenated to produce an n-dimensional output feature map.
The following PyTorch implementation demonstrates the ESP module structure:
class ESPUnit(nn.Module):
def __init__(self, in_ch, out_ch, num_branches=5, kernel_sz=3, stride_val=1, activation='prelu'):
super(ESPUnit, self).__init__()
self.num_branches = num_branches
self.stride_val = stride_val
self.skip_connection = (in_ch == out_ch) and (stride_val == 1)
branch_channels = out_ch // num_branches
primary_channels = out_ch - (num_branches - 1) * branch_channels
self.uniform_division = primary_channels == branch_channels
if self.uniform_division:
self.reduce_conv = conv1x1(in_ch, branch_channels, stride_val)
else:
self.reduce_conv = conv1x1(in_ch, branch_channels, stride_val)
self.primary_conv = conv1x1(in_ch, primary_channels, stride_val)
self.branch_layers = nn.ModuleList()
for idx in range(1, num_branches + 1):
dilation_val = 2 ** (idx - 1)
ch = primary_channels if idx == 1 else branch_channels
self.branch_layers.append(
ConvBNAct(ch, ch, kernel_sz, 1, dilation_val, act_type=activation)
)
def forward(self, x):
if self.skip_connection:
residual = x
transformed = []
if self.uniform_division:
x = self.reduce_conv(x)
for i in range(self.num_branches):
transformed.append(self.branch_layers[i](x))
for j in range(1, self.num_branches):
transformed[j] += transformed[j - 1]
else:
x_primary = self.primary_conv(x)
x_branch = self.reduce_conv(x)
transformed.append(self.branch_layers[0](x_primary))
for i in range(1, self.num_branches):
transformed.append(self.branch_layers[i](x_branch))
for j in range(2, self.num_branches):
transformed[j] += transformed[j - 1]
output = torch.cat(transformed, dim=1)
if self.skip_connection:
output += residual
return output
Hierarchical Feature Fusion
While concatenating dilated convolution outputs provides an expanded effective receptive field, this approach introduces unwanted checkerboard or grid artifacts. These artifacts occur when single active pixels interact with dilated kernels of rate r=2 using 3×3 kernels. The Hierarchical Feature Fusion (HFF) mechanism addresses this issue by performing hierarchical addition of feature maps obtained from kernels with different dilation rates before concatenation. This solution proves both simple and effective without adding architectural complexity, distinguishing it from alternative approaches that employ smaller dilation rates to learn additional parameters.
To improve gradient flow through out the network, element-wise summation combines input and output feature maps of the ESP module, facilitating more effective backpropagation.
Network Architecture
ESPNet leverages ESP modules for learning convolutions and downsampling operations, with the first layer implementing standard strided convolution. All convolutional and ESP layers are followed by batch normalization and PReLU non-linearity, except for the final point-wise convolution which omits both. The final layer applies softmax for pixel-wise classification.
The architecture presents four progressive variants. ESPNet-A represents the baseline network accepting RGB input and utilizing ESP modules to learn representations across different spatial hierarchies, producing segmentation masks. ESPNet-B enhances information flow by sharing feature maps between strided ESP modules and preceding ESP modules. ESPNet-C further strengthens input image processing within ESPNet-B to improve information propagation. All three variants generate outputs at 1/8 spatial resolution of the input. The complete ESPNet incorporates a lightweight decoder built on reduce-upsample-merge principles into ESPNet-C, outputting segmentation masks matching the input image resolution.
The hyperparameter α controls network depth for constructing computationally efficient edge-device networks. ESP modules at spatial level l are repeated α_l times. Higher spatial levels (l=0 and l=1) require additional memory due to larger feature map dimensions, so neither ESP nor convolutional modules are repeated at these levels to conserve memory.
ESPNet V2 Architecture
ESPNet V2 represents an evolution of ESPNet V1, introducing a lightweight, energy-efficient, and versatile convolutional neural network architecture. The model employs group convolutions and depth-wise dilated separable convolutions to capture expansive effective receptive fields while further reducing floating-point operations and parameter count. The architecture demonstrates effectiveness across image classification, object detection, and semantic segmentation tasks.
Key improvements over ESPNet V1 include:
- Replacement of point-wise convolutions with group point-wise convolutions
- Substitution of dilated convolutions with depth-wise dilated convolutions
- Integration of HFF between depth-wise dilated separable convolutions and point-wise convolutions to eliminate gridding artifacts
- Consolidation of K point-wise convolutions into a single group point-wise convolution
- Incorporation of average pooling to inject input image information into the EESP module
- Adoption of concatenation in place of element-wise addition for feature fusion
Depth-wise Dilated Separable Convolution
The Depth-wise Dilated Separable Convolution (DDConv) operates through two sequential steps. First, DDConv with dilation rate r processes each input channel independently to learn representative features from the effective receptive field. Second, a standard 1×1 convolution learns linear combination features from DDConv outputs.
The parameter efficiency and receptive field characteristics of various convolution types are compared below:
Enhanced Efficient Spatial Pyramid Module
The EESP module introduces a novel architecture leveraging depth-wise separable dilated convolutions and group point-wise convolutions, specifically optimized for edge deployment. Inspired by ESPNet's structural approach, EESP implements the reduce-split-transform-merge strategy while significantly reducing computational complexity through grouped operations. The hierarchical EESP variant incorporates shortcut connections to input images for more effective multi-scale representation learning.
With input channels M=240, group count g=K=4, and d=M/K=60, EESP achieves a 7× reduction in parameters compared to the original ESP module. The computational complexity reduction factor is expressed as (Md + n²dK) / (Md/g + (n² + d)dK), where K represents the number of dilated convolution pyramid levels. Recognizing that computing K individual point-wise convolutions is equivalent to a single point-wise grouped convolution with group count K, and that grouped convolution implementations offer superior efficiency, the architecture adopts the final optimized structure.
class EnhancedESP(nn.Module):
'''
Implements REDUCE ---> SPLIT ---> TRANSFORM --> MERGE pipeline
'''
def __init__(self, channels_in, channels_out, stride_val=1, num_groups=4, receptive_limit=7, downsample_mode='esp'):
super().__init__()
self.stride_val = stride_val
base_ch = int(channels_out / num_groups)
primary_ch = channels_out - (num_groups - 1) * base_ch
assert downsample_mode in ['avg', 'esp']
assert base_ch == primary_ch, f"Channel mismatch for depth-wise convolution: {base_ch} vs {primary_ch}"
self.projection = GroupConv(channels_in, base_ch, 1, stride=1, groups=num_groups)
receptive_ksize_map = {3: 1, 5: 2, 7: 3, 9: 4, 11: 5, 13: 6, 15: 7, 17: 8}
self.kernel_sizes = list()
for idx in range(num_groups):
ksize = int(3 + 2 * idx)
ksize = ksize if ksize <= receptive_limit else 3
self.kernel_sizes.append(ksize)
self.kernel_sizes.sort()
self.dilated_branches = nn.ModuleList()
for idx in range(num_groups):
dilation = receptive_ksize_map[self.kernel_sizes[idx]]
self.dilated_branches.append(
DilatedConv(base_ch, base_ch, kernelSize=3, stride=stride_val, groups=base_ch, dilation=dilation)
)
self.expansion_conv = GroupConv(channels_out, channels_out, 1, 1, groups=num_groups)
self.post_concat_bn = BatchNorm(channels_out)
self.final_activation = nn.PReLU(channels_out)
self.downsample_avg = True if downsample_mode == 'avg' else False
def forward(self, input_tensor):
# Reduce: Project high-dimensional features to low-dimensional space
projected = self.projection(input_tensor)
outputs = [self.dilated_branches[0](projected)]
# Split --> Transform --> HFF for each branch
for branch_idx in range(1, len(self.dilated_branches)):
branch_out = self.dilated_branches[branch_idx](projected)
# Hierarchical Feature Fusion
branch_out = branch_out + outputs[branch_idx - 1]
outputs.append(branch_out)
# Merge: Concatenate all branch outputs
merged = self.expansion_conv(
self.post_concat_bn(
torch.cat(outputs, 1)
)
)
del outputs
# Handle downsampling by returning concatenated vector
if self.stride_val == 2 and self.downsample_avg:
return merged
# Apply residual connection when dimensions match
if merged.size() == input_tensor.size():
merged = merged + input_tensor
return self.final_activation(merged)
Strided EESP Module
To enable effective multi-scale feature learning, the strided EESP module incorporates four key modifications. First, the DDConv layer receives a stride attribute for spatial reduction. Second, the right-side shortcut path includes average pooling for dimension matching. Third, feature fusion transitions from addition to concatenation, increasing feature dimensionality. Fourth, downsampled information from the original input image is incorporated, enriching feature representation. The process initially downsamples the image to match feature map dimensions, employs a standard 3×3 convolution for spatial representation learning, and uses a subsequent point-wise convolution for learning linear combinations across inputs, projecting into high-dimensional space.
class StridedDownSampler(nn.Module):
'''
Downsampling with two parallel branches: average pooling and strided EESP.
Outputs are concatenated and passed through activation for final result.
'''
def __init__(self, channels_in, channels_out, num_groups=4, receptive_limit=9, reinforce_input=True):
super().__init__()
new_out_ch = channels_out - channels_in
self.strided_eesp = EnhancedESP(channels_in, new_out_ch, stride_val=2,
num_groups=num_groups, receptive_limit=receptive_limit,
downsample_mode='avg')
self.pool_avg = nn.AvgPool2d(kernel_size=3, padding=1, stride=2)
if reinforce_input:
self.input_reinforce = nn.Sequential(
ConvBNReLU(ref_in_ch, ref_in_ch, 3, 1),
ConvBN(ref_in_ch, channels_out, 1, 1)
)
self.activation = nn.PReLU(channels_out)
def forward(self, input_tensor, aux_input=None):
avg_result = self.pool_avg(input_tensor)
eesp_result = self.strided_eesp(input_tensor)
combined = torch.cat([avg_result, eesp_result], 1)
if aux_input is not None:
current_w = avg_result.size(2)
while True:
aux_input = F.avg_pool2d(aux_input, kernel_size=3, padding=1, stride=2)
aux_w = aux_input.size(2)
if aux_w == current_w:
break
combined = combined + self.input_reinforce(aux_input)
return self.activation(combined)