Deep Dive into Linux Epoll and I/O Multiplexing

I/O multiplexing is essential for handling high concurrency. Before Linux 2.6, select and poll were the primary mechanisms, but they struggle with massive connection counts. epoll resolves these limitations through an optimized architecture.

Limitations of Select and Poll

When managing hundreds of thousands of connections where only a small fraction are active at any given moment, select introduces severe inefficiencies. The select function requires passing the entire set of monitored file descriptors to the kernel: active_fds = select(all_monitored_fds). As the number of connections grows, the cost of copying and scanning these large descriptor sets linearly increases, resulting in O(n) performance. Furthermore, select is constrained by FD_SETSIZE, typically hardcoded to 1024.

#define __FD_SETSIZE 1024

While poll removes the maximum descriptor limit, it still relies on linear scanning of all monitored descriptors, failing to solve the fundamental O(n) complexity bottleneck.

Epoll Architecture and Efficiency

epoll separates infrequent operations (registering descriptors) from frequent ones (waiting for events) using three core system calls:

  1. epoll_create(): Initializes the epoll instance.
  2. epoll_ctl(): Adds, modifies, or removes monitored descriptors (infrequent).
  3. epoll_wait(): Retrieves active descriptors (frequent).

Unlike select, epoll_wait does not require passing the entire descriptor set. Its efficiency relies on three internal mechanisms: mmap, red-black trees, and doubly linked lists.

Memory Mapping (mmap): The kernel and user space share the same physical memory region via mmap, eliminating the overhead of copying data between kernel and user space.

Red-Black Tree: Monitored sockets are stored in a red-black tree within the kernel. Adding or removing descriptors via epoll_ctl operates in O(log N) time.

Ready Doubly Linked List: When a network event occurs, the device driver invokes the callback function ep_poll_callback, which adds the corresponding descriptor to a ready list. epoll_wait simply checks this list and returns ready events in O(1) time.

Key kernel data structures manage this state:

struct EpItem {
    struct rb_node rb_node;       // Red-black tree node
    struct list_head ready_link;  // Node in the ready list
    struct epoll_filefd file_desc;// File descriptor info
    struct eventpoll *ep_instance;// Pointer to the parent instance
    struct epoll_event event;     // Registered events
};

struct EventPoll {
    spinlock_t lock;
    struct mutex mtx;
    wait_queue_head_t wait_queue;
    struct list_head ready_list;  // List of ready descriptors
    struct rb_root root;          // Root of the red-black tree
    struct EpItem *overflow_list;
};

Comparison of I/O Multiplexing Models

FeatureSelectPollEpoll
Event SetThree separate bitmasks (read, write, exception); must be reset on each callSingle array with separate fields for input and output eventsKernel-managed event table; only ready events are returned
Indexing ComplexityO(n)O(n)O(1)
Max DescriptorsLimited (typically 1024)Unlimited (65535)Unlimited (65535)
Trigger ModeLevel Triggered (LT)Level Triggered (LT)Supports Edge Triggered (ET)
Internal MechanismLinear polling O(n)Linear polling O(n)Callback based O(1)

epoll excels in scenarios with many connections but few active ones. If most connections are highly active, the overhead of frequent callbacks may reduce its advantage.

Epoll API Overview

#include <sys/epoll.h>
int epoll_create(int size);

The size parameter is historical and ignored in modern kernels.

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

Operations (op) include EPOLL_CTL_ADD, EPOLL_CTL_MOD, and EPOLL_CTL_DEL. The event structure is defined as:

struct epoll_event {
    uint32_t events;   // Epoll event mask
    epoll_data_t data; // User data
};

typedef union epoll_data {
    void *ptr;
    int fd;
    uint32_t u32;
    uint64_t u64;
} epoll_data_t;
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

Returns the number of ready descriptors. A timeout of -1 blocks indefinitely, while 0 returns immediately.

EPOLLONESHOT

When using multiple worker threads, a socket might trigger concurrent read operations. EPOLLONESHOT ensures that a descriptor triggers an event only once until explicitly re-armed using epoll_ctl with EPOLL_CTL_MOD. This guarantees that a socket is processed by only one thread at a time, preventing race conditions.

Level Triggered (LT) vs Edge Triggered (ET)

LT Mode: As long as the buffer has data to read or space to write, epoll_wait will return the descriptor as ready.

ET Mode: Triggers only when the state changes (e.g., buffer transitions from empty to readable). It requires non-blocking file descriptors and exhaustive reading/writing in a loop.

Read Behavior in ET vs LT

Consider a program monitoring standard input:

#include <stdio.h>
#include <unistd.h>
#include <sys/epoll.h>

int main() {
    int ep_inst = epoll_create(1);
    struct epoll_event cfg, occurred[5];
    cfg.data.fd = STDIN_FILENO;
    cfg.events = EPOLLIN | EPOLLET; // Edge Triggered
    epoll_ctl(ep_inst, EPOLL_CTL_ADD, STDIN_FILENO, &cfg);

    for (;;) {
        int n = epoll_wait(ep_inst, occurred, 5, -1);
        for (int i = 0; i < n; i++) {
            if (occurred[i].data.fd == STDIN_FILENO) {
                printf("Data arrived!\n");
            }
        }
    }
}

In ET mode, if input is provided but not fully consumed by a read call, subsequent calls to epoll_wait will block because the state did not change again. Changing to cfg.events = EPOLLIN (LT mode) causes epoll_wait to continuously return until the input buffer is emptied via read.

To properly handle ET mode reads, data must be drained completely:

// Inside the loop
if (occurred[i].data.fd == STDIN_FILENO) {
    char buffer[512];
    while (1) {
        ssize_t bytes = read(STDIN_FILENO, buffer, sizeof(buffer));
        if (bytes <= 0) break; // EAGAIN or EWOULDBLOCK if non-blocking
    }
}

Alternatively, re-arming the descriptor manually in ET mode forces a re-evaluation:

epoll_ctl(ep_inst, EPOLL_CTL_MOD, STDIN_FILENO, &cfg);

Write Behavior in ET vs LT

Monitoring standard output demonstrates write buffering differences:

#include <stdio.h>
#include <unistd.h>
#include <sys/epoll.h>

int main() {
    int ep_inst = epoll_create(1);
    struct epoll_event cfg, occurred[5];
    cfg.data.fd = STDOUT_FILENO;
    cfg.events = EPOLLOUT | EPOLLET; // Edge Triggered
    epoll_ctl(ep_inst, EPOLL_CTL_ADD, STDOUT_FILENO, &cfg);

    for (;;) {
        int n = epoll_wait(ep_inst, occurred, 5, -1);
        for (int i = 0; i < n; i++) {
            if (occurred[i].data.fd == STDOUT_FILENO) {
                printf("Ready to write!");
            }
        }
    }
}

Because standard output is typically line-buffered, printing without a newline leaves data in the buffer. In ET mode, the write-ready event is not triggered again unless the buffer transitions from full to having available space (or initially from full to available). Without the newline, the state doesn't transition in a way that re-triggers ET. In LT mode, as long as the buffer has space, it continuously triggers, creating an infinite loop.

Accept Handling in ET Mode

If multiple connections arrive simultaneously, ET mode triggers the listening socket only once. Failing to process all pending connections causes dropped connections. The solution is a non-blocking accept loop:

while (1) {
    int client_fd = accept(listen_fd, NULL, NULL);
    if (client_fd == -1) {
        if (errno == EAGAIN || errno == EWOULDBLOCK) {
            break; // All pending connections processed
        } else {
            // Handle error
            break;
        }
    }
    // Register client_fd with epoll
}

Non-Blocking Descriptors in ET Mode

Because ET mode requires exhaustive reading or writing until the operation would block (EAGAIN/EWOULDBLOCK), the file descriptor must be set to non-blocking. If a blocking descriptor is used, the final read or write operation will halt the entire thread, causing other descriptors to starve.

Tags: Linux I/O Multiplexing Epoll C networking

Posted on Thu, 07 May 2026 19:47:23 +0000 by fearfx