Linux Kernel Implementation of epoll

The epoll(2) mechanism in the Linux kernel provides an efficeint I/O event notification system. This analysis is based on kernel version 5.0.18 and explores its internal structures, system calls, and handling of ready events.

Core Data Structures

The primary data structure representing an epoll instance is struct eventpoll:

struct eventpoll {
    spinlock_t lock;
    struct mutex mtx;

    wait_queue_head_t wq;         /* Used by sys_epoll_wait() */
    wait_queue_head_t poll_wait;  /* For when this epoll fd is itself monitored */

    struct list_head rdllist;     /* List of ready file descriptors */
    struct epitem *ovflist;       /* Overflow list for concurrent ready events */

    struct rb_root_cached rbr;    /* Red-black tree of monitored fds */

    struct user_struct *user;
    struct file *file;

    int visited;
    struct list_head visited_list_link;
};

Each monitored file descriptor is represented by a struct epitem:

struct epitem {
    union {
        struct rb_node rbn;
        struct rcu_head rcu;
    };

    struct list_head rdllink;
    struct epitem *next;
    struct epoll_filefd ffd;

    int nwait;
    struct list_head pwqlist;

    struct eventpoll *ep;
    struct list_head fllink;

    struct epoll_event event;
};

The struct eppoll_entry links the monitored file's wait queue to the epitem:

struct eppoll_entry {
    struct list_head llink;
    struct epitem *base;
    wait_queue_entry_t wait;
    wait_queue_head_t *whead;
};

System Call Implementation

epoll_create(2)

Creates a new epoll instance by allocating a eventpoll structure and associating it with an anonymous inode file.

epoll_ctl(2)

Manages the set of monitored file descriptors through three operations:

  • EPOLL_CTL_ADD: Inserts a new file descriptor via ep_insert()
  • EPOLL_CTL_DEL: Removes a file descriptor via ep_remove()
  • EPOLL_CTL_MOD: Modifies monitoring parameters via ep_modify()

The ep_insert() function performs several critical steps:

  1. Allocates and initializes an epitem
  2. Registers a callback (ep_ptable_queue_proc) with the target file's poll mechanism
  3. Adds the item to the red-black tree and potentially to the ready list if already ready

epoll_wait(2)

Implemented primarily by ep_poll(), which:

  1. Checks for ready events using ep_events_available() (O(1) complexity)
  2. If no events are available, blocks on the wait queue until awakened
  3. Copies ready events to userspace via ep_send_events()

Ready Event Handling

The key to epoll's efficiency is its ready list management. When a monitored file becomes ready, its poll calback (ep_poll_callback) is invoked, which adds the corresponding epitem to the ready list.

The ep_scan_ready_list() function processes ready events by:

  1. Moving the current ready list to a temporary list
  2. Setting ovflist to NULL to handle concurrent additions during processing
  3. Executing the processing callback (ep_send_events_proc)
  4. Merging any events added during processing from ovflist back to the main ready list

In ep_send_events_proc(), the distinction between edge-triggered (EPOLLET) and level-triggered modes is implemented:

  • For level-triggered (default), items are re-added to the ready list after processing
  • For edge-triggered, items are not automatically re-added, requiring explicit re-monitoring

Epoll Nesting and Loop Detection

Since epoll file descriptors can monitor other epoll instances, complex dependency graphs can form. The kernel prevents infinite recursion through:

  1. Loop detection: During EPOLL_CTL_ADD, ep_loop_check() traverses the monitoring graph to detect cycles
  2. Nesting limits: The reverse_path_check() function enforces depth-based limits on wakeup chains

The ep_call_nested() helper manages recursive traversal with per-task context tracking to avoid false positives in multi-threaded scenarios.

Key Design Insights

  1. O(1) event retrieval: By maintaining a separate ready list, epoll_wait() avoids scanning all monitored descriptors
  2. Callback-driven readiness: Integration with each file's poll mechanism enables immediate notification of state changes
  3. Edge vs. level triggering: Implemented through post-processing ready list management rather than fundamentally different mechanisms
  4. Exclusive wakeups: The EPOLLEXCLUSIVE flag prevents thundering herd problems by limiting wakeups to one waiter per event

Tags: linux-kernel Epoll system-calls event-driven-io polling-mechanisms

Posted on Wed, 13 May 2026 21:03:41 +0000 by JackOfBlades