Since AlexNet's introduction in 2012, convolutional neural networks have become widely adopted in computer vision tasks. As performance requirements increased, researchers developed deeper architectures like VGG, GoogLeNet, ResNet, and DenseNet. However, these deeper networks introduced significant efficiency challenges:
- Storage Requirements: Deep networks contain millions of parameters, demanding substantial memory
- Computational Speed: Real-time applications require millisecond-level inference, which becomes challenging with complex models
Model compression techniques address these issues by reducing parameters in trained networks. Alternatively, efficient network design focuses on creating fundamentally more efficient convolutional operations.
MobileNetV1
Google introduced MobileNet in 2017 as a lightweight CNN for mobile devices. Its key innovation was depthwise separable convolution.
Traditional convolution performs channel-wise multiplication and summation in one step, with computational cost:
DF × DF × DK × DK × M × N
where DF is feature map size, DK is kernel size, M is input channels, and N is output channels.
Depthwise separable convolution splits this into two operations:
- Depthwise convolution: Channel-independent spatial convolution
- Pointwise convolution: 1×1 convolution for channel mixing
The computational cost becomes:
DK × DK × M × DF × DF + 1 × 1 × M × N × DF × DF
This reduces computation by 8-9× compared to standard convolution.
The network uses 28 layers with stride-based downsampling instead of pooling. Both depthwise and pointwise layers are followed by batch normalization and ReLU6 activation.
ReLU6 (ReLU capped at 6) performs well on embedded devices with float16/int8 precision. Two hyperparameters control model size:
- Width multiplier (α): Scales channel counts
- Resolution multiplier (β): Scales input resolution
MobileNetV2
The 2018 update introduced inverted residuals and linear bottlenecks.
Linear Bottlenecks
Researchers observed that ReLU activation in low-dimensional spaces causes significant information loss. When transforming low-dimensional data to high dimensions and back through ReLU, information loss becomes substantial. The solution replaces final ReLU with linear activation in bottleneck layers.
Inverted Residuals
Standard residual blocks use "compress-conv-expand" flow:
- 1×1 convolution reduces channels
- 3×3 convolution extracts features
- 1×1 convolution restores channels
Inverted residuals use "expand-conv-compress":
- 1×1 convolution expands channels (6× expansion factor)
- 3×3 depthwise convolution
- 1×1 convolution compresses channels
This design works because depthwise convolution cannot alter channel count, making feature extraction dependent on input channels.
The combined block structure:
- Stride=1: Expansion → depthwise conv → compression → residual connection
- Stride=2: Same but without residual connection
MobileNetV2 contains 54 layers, significantly deeper than V1 while maintaining efficiency.
MobileNetV3
The 2019 version combines elements from previous versions with neural architecture search (NAS). Key improvements include:
Network Architecture Optimizations
- Input/Output Optimization: The computationally expensive first and last layers were redesigned. Average pooling was moved earlier in the network, reducing latency by 10ms with minimal accuracy loss.
- Input Channel Reduction: The input layer was optimized to reduce channels from 32 to 16 while maintaining accuracy.
Activation Functions
- h-swish: Hardware-friendly swish replacement that avoids expensive sigmoid computations
- Selective use of h-swish in deeper layers where nonlinear activation cost are lower
Enhanced Blocks
- Squeeze-and-Excite Integration: Lightweight channel attention mechanism added to V2 blocks
- NAS-optimized architectures for different resource constraints (Large and Small variants)
Network Structure
The architecture comprises three sections:
- Initial section: Single convolutional layer with 3×3 kernels
- Middle section: Multiple convolutional blocks (MobileBlock layers)
- Final section: Two 1×1 convolutional layers replacing fully connected layers
Implementation Details
Initial Convolution:
initial_channels = _make_divisible(16 * width_multiplier)
self.initial_conv = nn.Sequential(
nn.Conv2d(3, initial_channels, 3, stride=2, padding=1),
nn.BatchNorm2d(initial_channels),
h_swish(inplace=True)
)
h-swish Activation:
def h_swish(x):
return x * F.relu6(x + 3) / 6
MobileBlock Structure:
class MobileBlock(nn.Module):
def __init__(self, input_ch, output_ch, kernel, stride, activation, use_se, expansion):
super().__init__()
# Expansion convolution
self.expand_conv = nn.Sequential(
nn.Conv2d(input_ch, expansion, 1, stride=1, padding=0),
nn.BatchNorm2d(expansion),
activation(inplace=True)
)
# Depthwise convolution
self.depth_conv = nn.Sequential(
nn.Conv2d(expansion, expansion, kernel, stride=stride,
padding=kernel//2, groups=expansion),
nn.BatchNorm2d(expansion),
activation(inplace=True)
)
# SE module (optional)
self.se_block = SqueezeExcite(expansion) if use_se else None
# Projection convolution
self.project_conv = nn.Sequential(
nn.Conv2d(expansion, output_ch, 1, stride=1, padding=0),
nn.BatchNorm2d(output_ch)
)
self.use_residual = (stride == 1 and input_ch == output_ch)
def forward(self, x):
residual = x
out = self.expand_conv(x)
out = self.depth_conv(out)
if self.se_block:
out = self.se_block(out)
out = self.project_conv(out)
if self.use_residual:
return out + residual
return out
Squeeze-and-Excite Module:
class SqueezeExcite(nn.Module):
def __init__(self, channels, reduction=4):
super().__init__()
reduced_channels = channels // reduction
self.attention = nn.Sequential(
nn.Linear(channels, reduced_channels),
nn.ReLU(inplace=True),
nn.Linear(reduced_channels, channels),
h_sigmoid()
)
def forward(self, x):
batch, channels, height, width = x.size()
# Global average pooling
squeezed = F.adaptive_avg_pool2d(x, 1).view(batch, channels)
# Channel weighting
weights = self.attention(squeezed).view(batch, channels, 1, 1)
return x * weights
Final Layers:
# Early average pooling for efficiency
output = self.final_conv1(features)
batch, channels, height, width = output.size()
output = F.adaptive_avg_pool2d(output, 1)
output = self.final_conv2(output)
output = output.view(batch, -1)
Channel Adjustment:
def adjust_channels(channels, divisor=8, minimum=None):
if minimum is None:
minimum = divisor
adjusted = max(minimum, int(channels + divisor / 2) // divisor * divisor)
# Ensure adjustment doesn't reduce too much
if adjusted < 0.9 * channels:
adjusted += divisor
return adjusted