Foundations of Operations Management and the IT Operations Framework

Operations management encompasses the strategies and practices used to ensure the reliability, efficiency, and continuous evolution of IT services. This article explores the core components, including ITIL service structures, incident and problem management, standardization, automation, high availability concepts, and team organization.

ITIL Service Desk Structure

A typical service desk is organized into three tiers:

First line: Handles service requests and initial triage.
Second line: Provides architectural support and deeper technical analysis.
Third line: Addresses application-specific issues and advanced troubleshooting.

Core Management Practices

Change Management

Change management involves structured notifications and the use of detailed checklists to minimize risk during modifications.

Configuration Management

Maintains a reliable record of all configuration items (CIs) and their relationships within the IT infrastructure.

Event Management

Monitors and manages events throughout the IT environment to ensure normal operations and detect exceptions.

Fault Management: Classification and Handling Process

Effective fault management relies on a clear process tied to business continuity and disaster recovery planning (BCM, BCP, DRP). Key steps include:

Categorizing faults by severity and impact.
Defining escalation paths.
Implementing resolution steps and post-incident reviews.

Problem Management

Problem management goes beyond individual incidents to identify root causes and prevent recurrence.

Implementing Standardization and Automation

Moving from theory to practice requires a focus on standardization and automation. A standard operational checklist includes:

CMDB asset inventory and backup procedures.
High availability strategies: rate limiting, graceful degradation, circuit breakers, chaos engineering, and rollback plans.
Monitoring and observability: alerts, log aggregation, distributed tracing, and capacity planning.
Change delivery: CI/CD pipelines, canary releases, and gradual rollouts.
System installation: standardized setup processes and middleware management.

Operational Maturity Model

The target operational evolution follows a staged path:

Standardization (establishing conventions)
Automation
Intelligence (data-driven decisions)
Visualization and data-driven operations (SRE, SLAs)

Management Domains

Service Catalog & Request Fulfillment
Change and Configuration Management
Event, Fault, and Problem Management

Team Structure

L1: Service desk and straightforward request fulfillment.
L2: Generalist engineers handling complex issues.
L3: Specialists focused on specific applications or domains.

Operational Goals

The primary goal is to support continuous business iteration and growth while maintaining stability, efficiency, and cost-effectiveness. Key soft skills include communication, coordination, information synchronization, and a results-oriented mindset.

Problem-Solving Approach

Break down complex issues by systematically identifying what the problem is, decomposing it, and tackling each part.

Core Objectives of Operations

Supporting business functions
Ensuring high availability
Designing for scalability
Reducing costs while increasing efficiency
Ensuring high performance

Operational Components: People, Tasks, and Tools

Effective operations management involves managing people, tasks, relationships, structures, processes, controls, acceptance criteria, and milestones.

Essential Operational Tasks

CMDB & Asset Configuration Management
Unified Authentication: Internal services, SSO, OpenLDAP.
Email Systems
Network Infrastructure: Intranet, VPN, jump hosts, bastion hosts.
DNS Management
Desktop Support: Printers, projectors, meeting rooms, office software.
Documentation Wiki
Project & Bug Tracking: JIRA.
Service Tree Mapping
Unified Monitoring & Logging Platforms
Ticketing Platforms
Office Automation (OA) Systems
Backup & Recovery
Hardware Procurement
Code Hosting
Firewall Management
Middleware & Components: Nginx, Redis, databases, load balancers.
Release & DevOps Platforms: CI/CD.
Cloud Platforms: IaaS, PaaS, SaaS, containers, Kubernetes.

Advanced Platforms and Tools

Beyond the essentials, a mature operation typically integrates:

Push notification platform (SMS, email, Lark, DingTalk, WeChat)
Automated testing platform
Performance testing (stress test) platform
Chaos engineering platform
API Gateway
Database Management System (DBMS)
Configuration management platform
Data backbone: MQ, data links, Elasticsearch, HBase, Hadoop
Scheduled task system
Service registry
Artifact repositories: Maven, npm, Flutter, container image registries
Data warehousing, BI, and reporting platforms
Product prototyping system
Secrets management (e.g., KeePass)

Understanding High Availability (The "Nines")

System reliability is often measured in "nines" of availability, representing the percentage of uptime over a year.

3 nines (99.9%): Allows for up to 8.76 hours of downtime annually.
4 nines (99.99%): Allows for 52.6 minutes of downtime annual.
5 nines (99.999%): Allows for 5.26 minutes of downtime annually.

A broader comparision of availability levels:

1 nine (90%): 36.5 days of downtime.
2 nines (99%): 3.65 days of downtime.
6 nines (99.9999%): 31 seconds of downtime.

While six nines is technically achievable, the exponential cost increase from five to six nines often makes it impractical for most business cases.

Availability	Nines	Annual Downtime (Minutes)	Typical Application
0.999	3	500	PCs / Servers
0.9999	4	50	Enterprise Devices
0.99999	5	5	Telecom Equipment
0.999999	6	0.5	High-End Telecom

A Comprehensive Operational Model

A holistic model for IT operations covers business, product, and technical architecture.

Static Foundation

Technology Architecture: Static analysis and design.
Technical Environment: Asset management, credentials management, CMDB, and resource topology.

Dynamic Operations

Installation and Standardization: Best practices for metrics (counters, timers, capacity, stress testing).
Monitoring and Observability
High Availability Strategies: Failover, replication, rate limiting, degradation, circuit breaking, chaos engineering, rollback.
Backup and Recovery Drills
Fault Handling
Release Management: DevOps pipelines, canary releases, low-traffic testing.
Routine Inspections

Process Management

Service Requests: Standardized workflows, SSO/LDAP, JIRA, Wiki, network security, VPN, bastion hosts.
Change Management: Ticketing, announcement, and scope review.
Configuration Management
Event, Fault, and Problem Management

Team Tiers

L1: Handles simple tasks with direct, immediate results.
L2: Manages comprehensive, cross-domain issues.
L3: Focuses on deep, domain-specific solutions.

Maturation Trajectory

Standardization → Process formalization → Automation → Platform building → Intelligence → Visualization → Data-driven operations (intelligent monitoring and analytics)

Key Artifacts

Documentation of service relationships and call chains, mapping business functions to technical systems.
Service Level Indicators (SLIs), Objectives (SLOs), and Agreements (SLAs).

Project Orientation and Assessment

When taking over a system, a thorough assessment is the first step to ensure maintainability. This typically includes:

Project overview
Credentials and account table
Resource and configuration list
Various architecture and flow diagrams
Deployment and maintenance documentation
Monitoring strategy summary
Emergency operations manual

1. Project Overview

Understand the business scope (what the project does) and identify the project lead and relevant personnel for smooth handover.

2. Credentials Table

A detailed record of all server, platform, and system usernames and passwords. An associated permissions log should track who has access to what.

3. Resource and Configuration List

This should encompass multiple views, such as:

Server inventory
Project domain names (all environments, e.g., production, staging)
Project services
Third-party service resources

4. Architecture and Flow Diagrams

Create and verify diagrams like the network topology and system architecture. This shows the number of servers, the logical flow (e.g., from DNS and DDoS protection to load balancers and downstream services), and how different services interact with databases and each other.

5. Deployment and Maintenance Documentation

A detailed guide with step-by-step installation instructions and clear configuration points (like database addresses). This document is only validated when a colleague can use it to independently restore or replicate the project.

6. Monitoring Strategy Summary

This should cover both base infrastructure monitoring (CPU, memory, bandwidth) and business-specific monitoring that directly exposes service health. This table helps in filling monitoring gaps.

7. Emergency Operations Manual

A playbook for known and frequently occurring issues, detailing their immediate solutions and workarounds.

Building Standards and Processes

After mapping the system, the next step is to enforce standardization. Key conventions include:

Naming conventions: For services, based on project environment or purpose.
Port specifications: Allocated by purpose and environment.
IP address planning: Segmented by project, environment, and region.
Directory structures: Fixed paths for service deployment, log output, backups, toolkits, data, and scripts.

Operational Processes

Repeatable tasks must be formalized into processes:

Resource budget planning
Server purchasing and provisioning
Service deployment, launch, and maintenance
Account and permission provisioning
Regular backup restoration and verification drills
Periodic resource usage review
Incident report management

The goal is to standardize all resources and processify all repetitive actions. This prevents oversights and forms the foundation for automation.

Change Management in Practice

A common pitfall is a lack of rigor in operational changes, unlike software development where design discussions are the norm. For example, modifying a security group should follow a strict process, not an informal notification. Rules must be established to make all changes auditable and reversible.

Managing the Three Pillars: People, Tasks, and Assets

People: Define roles, career paths, and skill development. Foster a collaborative team with high technical skill and professional ethics through training and performance reviews.
Tasks: Execute daily operations to protect the production environment. Continuously explore new concepts and optimize system architecture. This includes proces management, capacity planning, emergency response, monitoring, and security.
Assets: Manage all physical and logical resources (data centers, servers, networks, software). The goal is clear configuration and lifecycle management: knowing where an asset comes from, where it is, and where it is going.
Processes and Standards: The glue that binds people, tasks, and assets into a smooth, efficient, and stable operational stream.

Core Competency Model for Operations Teams

Standards Implementation: Grounding operations in frameworks like ITIL, ISO20000, ITSS, and compliance mandates (e.g., China's Multi-Level Protection Scheme).
Fundamental Guarantees: Configuration management, monitoring, app release, resource scaling, event/incident handling.
Basic Technical Skills: Proficiency in networks, servers, OS, databases, middleware, JVM tuning.
Business Service: SLA management, service desk, business consulting, experience databases.
Availability Management: Routine inspections, business continuity, high-availability architecture, spare parts redundancy.
Risk and Security: Operational auditing, regulatory risk, vulnerability and attack management.
Incident and Problem Management
Continuous Delivery: Application changes, infrastructure delivery, office services.
Proactive Optimization: Architecture, performance, and user experience enhancements.
Emergency Drills: Testing high-availability setups, incident response plans, and team readiness.
Business Support: Data maintenance, extraction, and parameter management.
Operational Analysis: Capacity, performance, and availability analytics.
Operational Enablement: Identifying and solving business pain points, enhancing customer experience.
Cost Control: Optimizing spend on personnel, hardware, bandwidth, and software.
Platform Engineering: Building internal automation tools and cultivating a DevOps culture.

Team Communication and Continuous Improvement

Structured communication is vital for an operations team:

Daily Stand-up: 10 minutes.
Evening Sync-up: 30 minutes.
Weekly Review

The Deming Cycle (PDCA) is the ideal model for achieving sustainable improvement in an IT operations system:

Plan: Define objectives and processes.
Do: Implement the plan.
Check: Monitor and measure results.
Act: Take action to standardize or improve.

Tags: IT Operations ITIL devops SRE High Availability

Posted on Wed, 13 May 2026 17:03:31 +0000 by kante

Freaks City