Scalable Concurrent Memory Allocator Architecture

Dynamic memory allocation via standard libray routines introduces measurable overhead in high-throughput systems. Frequent transitions to kernel space, unpredictable block sizes, and continuous allocation/deallocation cycles generate both internal and external fragmentasion. These inefficiencies degrade cache locality, increase latency, and burden the operating system’s virtual memory manager. A hierarchical memory pool mitigates these issues by pre-reserving large contiguous regions and managing sub-allocations locally, while recycling released blocks to avoid immediate OS relinquishment.

Fixed-Size Block Allocator

When allocation requests target uniform sizes, fragmentation is eliminated entirely. The design relies on a pre-mapped memory buffer combined with a singly linked recycling chain. Upon initialization, the allocator reserves a substantial memory region. When a request arrives, it checks the recycling chain first. If empty, it slices a block from the reserved buffer. Once the buffer depletes, a new region is requested from the kernel. Released objects bypass the OS and are prepended to the recycling chain.

To maintain cross-platform compatibility (32-bit vs 64-bit), the allocator overlays a pointer-sized field at the start of each released block. This ensures the initial bytes always store the next available address, regardless of the original object type.

template <typename T>
class UniformBlockAllocator {
private:
    char* pool_buffer = nullptr;
    size_t available_capacity = 0;
    void* recycling_chain = nullptr;

public:
    T* allocate() {
        T* target = nullptr;
        
        // Reuse previously released block
        if (recycling_chain != nullptr) {
            void* next_block = *static_cast<void**>(recycling_chain);
            target = static_cast<T*>(recycling_chain);
            recycling_chain = next_block;
        } else {
            // Request new kernel region if pool is exhausted
            if (available_capacity < sizeof(T)) {
                available_capacity = 1 << 17; // 128KB
                pool_buffer = static_cast<char*>(syscall_mmap(available_capacity));
                if (!pool_buffer) throw std::bad_alloc();
            }
            
            target = reinterpret_cast<T*>(pool_buffer);
            size_t block_stride = (sizeof(T) < sizeof(void*)) ? sizeof(void*) : sizeof(T);
            pool_buffer += block_stride;
            available_capacity -= block_stride;
        }
        
        // Placement new for construction
        new (target) T();
        return target;
    }

    void deallocate(T* obj) {
        if (obj) {
            obj->~T();
            // Prepend to recycling chain
            void** header = reinterpret_cast<void**>(obj);
            *header = recycling_chain;
            recycling_chain = obj;
        }
    }
};

Multi-Level Concurrent Architecture

Scaling to multi-threaded workloads requires eliminating contention on a single allocation heap. The architecture partitions memory management into three distinct layers, each optimized for specific access patterns and contention levels:

  1. Thread-Local Cache (TLC): Per-thread storage enabling lock-free allocation of small objects (≤ 256KB).
  2. Central Cache (CC): A globally shared pool servicing thread-local caches. Utilizes fine-grained bucket locks to minimize serialization.
  3. Page Cache (PC): The top-level manager interacting with the OS. Handles memory in fixed-size pages, merging adjacent free spans to combat external fragmentation.

Thread-Local Cache Design

Each thread maintains an array of recycling chains, indexed by aligned size classes. Directly mapping every possible byte size would exhaust memory. Instead, size classes employ progressive alignment rules: small objects align to 8 bytes, scaling up to 8KB alignment for larger requests. This caps the chain array to roughly 200 buckets.

Thread-local storage guarantees isolation. When a thread requests memory, the allocator computes the aligned size, indexes the corresponding chain, and pops a block. If empty, it fetches a batch from the Central Cache. The batch size follows a slow-start feedback algorithm, scaling up to a cap of 512 objects to balance latency and memory footprint.

class SizeClassifiers {
public:
    static constexpr size_t MAX_SMALL_OBJ = 256 * 1024;
    
    static size_t align_step(size_t requested) {
        if (requested <= 128) return 8;
        if (requested <= 1024) return 16;
        if (requested <= 8192) return 128;
        if (requested <= 65536) return 1024;
        return 8192;
    }
    
    static size_t compute_aligned(size_t req, size_t step) {
        return ((req + step - 1) & ~(step - 1));
    }
};

struct ChainNode {
    void* next = nullptr;
};

class LocalThreadCache {
    static thread_local LocalThreadCache* current_tls;
    std::array<std::vector<void*>, 208> tiered_chains;
    size_t max_batch = 1;
    
    void* fetch_batch_from_central(size_t idx, size_t aligned_sz);
    
public:
    void* request_space(size_t bytes) {
        size_t step = SizeClassifiers::align_step(bytes);
        size_t aligned = SizeClassifiers::compute_aligned(bytes, step);
        size_t idx = (aligned / step) - 1;
        
        if (tiered_chains[idx].empty()) {
            return fetch_batch_from_central(idx, aligned);
        }
        void* block = tiered_chains[idx].back();
        tiered_chains[idx].pop_back();
        return block;
    }
    
    void release_space(void* ptr, size_t bytes) {
        size_t step = SizeClassifiers::align_step(bytes);
        size_t aligned = SizeClassifiers::compute_aligned(bytes, step);
        size_t idx = (aligned / step) - 1;
        
        tiered_chains[idx].push_back(ptr);
        if (tiered_chains[idx].size() >= max_batch) {
            // Trim and return excess to central layer
        }
    }
};

Central Cache & Span Management

The Central Cache bridges local caches and the OS. It manages MemoryRegion structures representing contiguous page-aligned blocks. Each region tracks its sliced object free list and an active usage counter.

Contention is isolated per size bucket. Threads requesting different sizes operate independent. When a local cache requests a refill, the Central Cache locks only the target bucket, retrieves an available region, slices it into object-sized chunks, and transfers them. If no region exists, it requests pages from the Page Cache.

struct MemoryRegion {
    size_t page_start_idx = 0;
    size_t page_count = 0;
    size_t object_stride = 0;
    size_t allocated_items = 0;
    void* free_chain_head = nullptr;
    bool actively_used = false;
    
    MemoryRegion* prev = nullptr;
    MemoryRegion* next = nullptr;
};

class RegionChain {
    MemoryRegion* sentinel;
    std::mutex bucket_mutex;
public:
    RegionChain() { sentinel = new MemoryRegion(); sentinel->next = sentinel->prev = sentinel; }
    void attach(MemoryRegion* reg) { /* insertion logic */ }
    MemoryRegion* detach_front() { /* removal logic */ }
    bool empty() const { return sentinel->next == sentinel; }
    std::mutex& lock() { return bucket_mutex; }
};

class GlobalAllocationHub {
    static GlobalAllocationHub& instance() { static GlobalAllocationHub inst; return inst; }
    RegionChain buckets[208];
    
public:
    void* fetch_block_range(size_t& out_count, size_t size, void*& out_end) {
        size_t idx = /* compute index */;
        buckets[idx].lock().lock();
        
        MemoryRegion* span = buckets[idx].detach_front();
        if (!span || !span->free_chain_head) {
            buckets[idx].lock().unlock();
            span = SystemPageManager::get_instance().request_region(size, out_count);
            buckets[idx].lock().lock();
            if (span) buckets[idx].attach(span);
            else return nullptr;
        }
        
        void* start = span->free_chain_head;
        void* current = start;
        size_t fetched = 1;
        
        while (fetched < out_count && current) {
            current = static_cast<ChainNode*>(current)->next;
            fetched++;
        }
        
        span->free_chain_head = static_cast<ChainNode*>(current)->next;
        static_cast<ChainNode*>(current)->next = nullptr;
        out_end = current;
        out_count = fetched;
        buckets[idx].lock().unlock();
        return start;
    }
};

Page Cache & Region Merging

The Page Cache interacts directly with kernel allocators (mmap, VirtualAlloc). It tracks regions by page count in indexed chains. When a fully freed region returns from the Central Cache, the Page Cache inspects adjacent page indices. If neighboring regions are also free, they are coalesced into a larger block, eliminating external fragmentation. The merged region is then promoted to a higher-index bucket.

class SystemPageManager {
    static SystemPageManager& get_instance() { static SystemPageManager mgr; return mgr; }
    RegionChain page_buckets[129];
    std::unordered_map<size_t, MemoryRegion*> page_lookup;
    std::mutex global_lock;
    
public:
    MemoryRegion* request_region(size_t pages_req) {
        std::unique_lock<std::mutex> guard(global_lock);
        if (pages_req <= 128) {
            // Search exact bucket, then larger buckets, split if necessary
        }
        // Fallback to OS syscall
        void* addr = kernel_allocate((129 - 1) * PAGE_SIZE);
        MemoryRegion* fresh = allocate_span_control();
        fresh->page_start_idx = reinterpret_cast<uintptr_t>(addr) >> PAGE_SHIFT;
        fresh->page_count = 128;
        page_buckets[128].attach(fresh);
        page_lookup[fresh->page_start_idx] = fresh;
        return request_region(pages_req); // Recursive split
    }
    
    void return_region(MemoryRegion* reg) {
        std::unique_lock<std::mutex> guard(global_lock);
        // Merge with previous
        auto prev_it = page_lookup.find(reg->page_start_idx - 1);
        if (prev_it != page_lookup.end() && !prev_it->second->actively_used) {
            reg->page_start_idx = prev_it->second->page_start_idx;
            reg->page_count += prev_it->second->page_count;
            page_buckets[prev_it->second->page_count].detach(prev_it->second);
            dealloc_span_control(prev_it->second);
            page_lookup.erase(prev_it);
        }
        // Merge with next (similar logic)
        page_buckets[reg->page_count].attach(reg);
        reg->actively_used = false;
        page_lookup[reg->page_start_idx] = reg;
    }
};

Lock-Free Mapping via Radix Trees

Profiling initial implementations reveals that translating a pointer back to its managing MemoryRegion creates a severe bottleneck. Using std::unordered_map necessitates a global mutex because dynamic rehashing invalidates iterators and causes data races. This serialization heavily degrades concurrent throughput.

Replacing the hash map with a static radix tree (direct addressing) eliminates lock contention entirely. The array index corresponds directly to the page identifier. For 32-bit systems, a single-level array requires only a few megabytes. On 64-bit architectures, a multi-level radix tree is deployed to prevent exabyte-sized allocations, dynamically allocating second-level tables only when needed.

template <size_t SHIFT_BITS>
class DirectPageIndexer {
    void** index_table;
public:
    DirectPageIndexer() {
        size_t entries = 1ULL << SHIFT_BITS;
        index_table = static_cast<void**>(kernel_allocate(entries * sizeof(void*)));
        std::memset(index_table, 0, entries * sizeof(void*));
    }
    
    void* resolve(size_t page_key) const {
        if ((page_key >> SHIFT_BITS) > 0) return nullptr;
        return index_table[page_key];
    }
    
    void bind(size_t page_key, void* target) {
        assert((page_key >> SHIFT_BITS) == 0);
        index_table[page_key] = target;
    }
};

Integrating the radix tree transforms pointer mapping into a constant-time, lock-free array lookup. This optimization typically yields a 2x to 3x improvement over standard allocators in high-concurrency benchmarks, as threads bypass mutex contention entirely during object recycling and region lookup.

Tags: memory-management Concurrency c-plus-plus system-programming performance-optimization

Posted on Fri, 08 May 2026 08:17:50 +0000 by sglane