The epoll(2) mechanism in the Linux kernel provides an efficeint I/O event notification system. This analysis is based on kernel version 5.0.18 and explores its internal structures, system calls, and handling of ready events.
Core Data Structures
The primary data structure representing an epoll instance is struct eventpoll:
struct eventpoll {
spinlock_t lock;
struct mutex mtx;
wait_queue_head_t wq; /* Used by sys_epoll_wait() */
wait_queue_head_t poll_wait; /* For when this epoll fd is itself monitored */
struct list_head rdllist; /* List of ready file descriptors */
struct epitem *ovflist; /* Overflow list for concurrent ready events */
struct rb_root_cached rbr; /* Red-black tree of monitored fds */
struct user_struct *user;
struct file *file;
int visited;
struct list_head visited_list_link;
};
Each monitored file descriptor is represented by a struct epitem:
struct epitem {
union {
struct rb_node rbn;
struct rcu_head rcu;
};
struct list_head rdllink;
struct epitem *next;
struct epoll_filefd ffd;
int nwait;
struct list_head pwqlist;
struct eventpoll *ep;
struct list_head fllink;
struct epoll_event event;
};
The struct eppoll_entry links the monitored file's wait queue to the epitem:
struct eppoll_entry {
struct list_head llink;
struct epitem *base;
wait_queue_entry_t wait;
wait_queue_head_t *whead;
};
System Call Implementation
epoll_create(2)
Creates a new epoll instance by allocating a eventpoll structure and associating it with an anonymous inode file.
epoll_ctl(2)
Manages the set of monitored file descriptors through three operations:
EPOLL_CTL_ADD: Inserts a new file descriptor viaep_insert()EPOLL_CTL_DEL: Removes a file descriptor viaep_remove()EPOLL_CTL_MOD: Modifies monitoring parameters viaep_modify()
The ep_insert() function performs several critical steps:
- Allocates and initializes an
epitem - Registers a callback (
ep_ptable_queue_proc) with the target file's poll mechanism - Adds the item to the red-black tree and potentially to the ready list if already ready
epoll_wait(2)
Implemented primarily by ep_poll(), which:
- Checks for ready events using
ep_events_available()(O(1) complexity) - If no events are available, blocks on the wait queue until awakened
- Copies ready events to userspace via
ep_send_events()
Ready Event Handling
The key to epoll's efficiency is its ready list management. When a monitored file becomes ready, its poll calback (ep_poll_callback) is invoked, which adds the corresponding epitem to the ready list.
The ep_scan_ready_list() function processes ready events by:
- Moving the current ready list to a temporary list
- Setting
ovflistto NULL to handle concurrent additions during processing - Executing the processing callback (
ep_send_events_proc) - Merging any events added during processing from
ovflistback to the main ready list
In ep_send_events_proc(), the distinction between edge-triggered (EPOLLET) and level-triggered modes is implemented:
- For level-triggered (default), items are re-added to the ready list after processing
- For edge-triggered, items are not automatically re-added, requiring explicit re-monitoring
Epoll Nesting and Loop Detection
Since epoll file descriptors can monitor other epoll instances, complex dependency graphs can form. The kernel prevents infinite recursion through:
- Loop detection: During
EPOLL_CTL_ADD,ep_loop_check()traverses the monitoring graph to detect cycles - Nesting limits: The
reverse_path_check()function enforces depth-based limits on wakeup chains
The ep_call_nested() helper manages recursive traversal with per-task context tracking to avoid false positives in multi-threaded scenarios.
Key Design Insights
- O(1) event retrieval: By maintaining a separate ready list,
epoll_wait()avoids scanning all monitored descriptors - Callback-driven readiness: Integration with each file's poll mechanism enables immediate notification of state changes
- Edge vs. level triggering: Implemented through post-processing ready list management rather than fundamentally different mechanisms
- Exclusive wakeups: The
EPOLLEXCLUSIVEflag prevents thundering herd problems by limiting wakeups to one waiter per event