Foundations of Operations Management and the IT Operations Framework

Operations management encompasses the strategies and practices used to ensure the reliability, efficiency, and continuous evolution of IT services. This article explores the core components, including ITIL service structures, incident and problem management, standardization, automation, high availability concepts, and team organization.

ITIL Service Desk Structure

A typical service desk is organized into three tiers:

  • First line: Handles service requests and initial triage.
  • Second line: Provides architectural support and deeper technical analysis.
  • Third line: Addresses application-specific issues and advanced troubleshooting.

Core Management Practices

Change Management

Change management involves structured notifications and the use of detailed checklists to minimize risk during modifications.

Configuration Management

Maintains a reliable record of all configuration items (CIs) and their relationships within the IT infrastructure.

Event Management

Monitors and manages events throughout the IT environment to ensure normal operations and detect exceptions.

Fault Management: Classification and Handling Process

Effective fault management relies on a clear process tied to business continuity and disaster recovery planning (BCM, BCP, DRP). Key steps include:

  • Categorizing faults by severity and impact.
  • Defining escalation paths.
  • Implementing resolution steps and post-incident reviews.

Problem Management

Problem management goes beyond individual incidents to identify root causes and prevent recurrence.

Implementing Standardization and Automation

Moving from theory to practice requires a focus on standardization and automation. A standard operational checklist includes:

  • CMDB asset inventory and backup procedures.
  • High availability strategies: rate limiting, graceful degradation, circuit breakers, chaos engineering, and rollback plans.
  • Monitoring and observability: alerts, log aggregation, distributed tracing, and capacity planning.
  • Change delivery: CI/CD pipelines, canary releases, and gradual rollouts.
  • System installation: standardized setup processes and middleware management.

Operational Maturity Model

The target operational evolution follows a staged path:

  1. Standardization (establishing conventions)
  2. Automation
  3. Intelligence (data-driven decisions)
  4. Visualization and data-driven operations (SRE, SLAs)

Management Domains

  • Service Catalog & Request Fulfillment
  • Change and Configuration Management
  • Event, Fault, and Problem Management

Team Structure

  • L1: Service desk and straightforward request fulfillment.
  • L2: Generalist engineers handling complex issues.
  • L3: Specialists focused on specific applications or domains.

Operational Goals

The primary goal is to support continuous business iteration and growth while maintaining stability, efficiency, and cost-effectiveness. Key soft skills include communication, coordination, information synchronization, and a results-oriented mindset.

Problem-Solving Approach

Break down complex issues by systematically identifying what the problem is, decomposing it, and tackling each part.

Core Objectives of Operations

  • Supporting business functions
  • Ensuring high availability
  • Designing for scalability
  • Reducing costs while increasing efficiency
  • Ensuring high performance

Operational Components: People, Tasks, and Tools

Effective operations management involves managing people, tasks, relationships, structures, processes, controls, acceptance criteria, and milestones.

Essential Operational Tasks

  1. CMDB & Asset Configuration Management
  2. Unified Authentication: Internal services, SSO, OpenLDAP.
  3. Email Systems
  4. Network Infrastructure: Intranet, VPN, jump hosts, bastion hosts.
  5. DNS Management
  6. Desktop Support: Printers, projectors, meeting rooms, office software.
  7. Documentation Wiki
  8. Project & Bug Tracking: JIRA.
  9. Service Tree Mapping
  10. Unified Monitoring & Logging Platforms
  11. Ticketing Platforms
  12. Office Automation (OA) Systems
  13. Backup & Recovery
  14. Hardware Procurement
  15. Code Hosting
  16. Firewall Management
  17. Middleware & Components: Nginx, Redis, databases, load balancers.
  18. Release & DevOps Platforms: CI/CD.
  19. Cloud Platforms: IaaS, PaaS, SaaS, containers, Kubernetes.

Advanced Platforms and Tools

Beyond the essentials, a mature operation typically integrates:

  • Push notification platform (SMS, email, Lark, DingTalk, WeChat)
  • Automated testing platform
  • Performance testing (stress test) platform
  • Chaos engineering platform
  • API Gateway
  • Database Management System (DBMS)
  • Configuration management platform
  • Data backbone: MQ, data links, Elasticsearch, HBase, Hadoop
  • Scheduled task system
  • Service registry
  • Artifact repositories: Maven, npm, Flutter, container image registries
  • Data warehousing, BI, and reporting platforms
  • Product prototyping system
  • Secrets management (e.g., KeePass)

Understanding High Availability (The "Nines")

System reliability is often measured in "nines" of availability, representing the percentage of uptime over a year.

  • 3 nines (99.9%): Allows for up to 8.76 hours of downtime annually.
  • 4 nines (99.99%): Allows for 52.6 minutes of downtime annual.
  • 5 nines (99.999%): Allows for 5.26 minutes of downtime annually.

A broader comparision of availability levels:

  • 1 nine (90%): 36.5 days of downtime.
  • 2 nines (99%): 3.65 days of downtime.
  • 6 nines (99.9999%): 31 seconds of downtime.

While six nines is technically achievable, the exponential cost increase from five to six nines often makes it impractical for most business cases.

Availability Nines Annual Downtime (Minutes) Typical Application
0.999 3 500 PCs / Servers
0.9999 4 50 Enterprise Devices
0.99999 5 5 Telecom Equipment
0.999999 6 0.5 High-End Telecom

A Comprehensive Operational Model

A holistic model for IT operations covers business, product, and technical architecture.

Static Foundation

  • Technology Architecture: Static analysis and design.
  • Technical Environment: Asset management, credentials management, CMDB, and resource topology.

Dynamic Operations

  • Installation and Standardization: Best practices for metrics (counters, timers, capacity, stress testing).
  • Monitoring and Observability
  • High Availability Strategies: Failover, replication, rate limiting, degradation, circuit breaking, chaos engineering, rollback.
  • Backup and Recovery Drills
  • Fault Handling
  • Release Management: DevOps pipelines, canary releases, low-traffic testing.
  • Routine Inspections

Process Management

  • Service Requests: Standardized workflows, SSO/LDAP, JIRA, Wiki, network security, VPN, bastion hosts.
  • Change Management: Ticketing, announcement, and scope review.
  • Configuration Management
  • Event, Fault, and Problem Management

Team Tiers

  • L1: Handles simple tasks with direct, immediate results.
  • L2: Manages comprehensive, cross-domain issues.
  • L3: Focuses on deep, domain-specific solutions.

Maturation Trajectory

Standardization → Process formalization → Automation → Platform building → Intelligence → Visualization → Data-driven operations (intelligent monitoring and analytics)

Key Artifacts

  • Documentation of service relationships and call chains, mapping business functions to technical systems.
  • Service Level Indicators (SLIs), Objectives (SLOs), and Agreements (SLAs).

Project Orientation and Assessment

When taking over a system, a thorough assessment is the first step to ensure maintainability. This typically includes:

  • Project overview
  • Credentials and account table
  • Resource and configuration list
  • Various architecture and flow diagrams
  • Deployment and maintenance documentation
  • Monitoring strategy summary
  • Emergency operations manual

1. Project Overview

Understand the business scope (what the project does) and identify the project lead and relevant personnel for smooth handover.

2. Credentials Table

A detailed record of all server, platform, and system usernames and passwords. An associated permissions log should track who has access to what.

3. Resource and Configuration List

This should encompass multiple views, such as:

  • Server inventory
  • Project domain names (all environments, e.g., production, staging)
  • Project services
  • Third-party service resources

4. Architecture and Flow Diagrams

Create and verify diagrams like the network topology and system architecture. This shows the number of servers, the logical flow (e.g., from DNS and DDoS protection to load balancers and downstream services), and how different services interact with databases and each other.

5. Deployment and Maintenance Documentation

A detailed guide with step-by-step installation instructions and clear configuration points (like database addresses). This document is only validated when a colleague can use it to independently restore or replicate the project.

6. Monitoring Strategy Summary

This should cover both base infrastructure monitoring (CPU, memory, bandwidth) and business-specific monitoring that directly exposes service health. This table helps in filling monitoring gaps.

7. Emergency Operations Manual

A playbook for known and frequently occurring issues, detailing their immediate solutions and workarounds.

Building Standards and Processes

After mapping the system, the next step is to enforce standardization. Key conventions include:

  • Naming conventions: For services, based on project environment or purpose.
  • Port specifications: Allocated by purpose and environment.
  • IP address planning: Segmented by project, environment, and region.
  • Directory structures: Fixed paths for service deployment, log output, backups, toolkits, data, and scripts.

Operational Processes

Repeatable tasks must be formalized into processes:

  • Resource budget planning
  • Server purchasing and provisioning
  • Service deployment, launch, and maintenance
  • Account and permission provisioning
  • Regular backup restoration and verification drills
  • Periodic resource usage review
  • Incident report management

The goal is to standardize all resources and processify all repetitive actions. This prevents oversights and forms the foundation for automation.

Change Management in Practice

A common pitfall is a lack of rigor in operational changes, unlike software development where design discussions are the norm. For example, modifying a security group should follow a strict process, not an informal notification. Rules must be established to make all changes auditable and reversible.

Managing the Three Pillars: People, Tasks, and Assets

  1. People: Define roles, career paths, and skill development. Foster a collaborative team with high technical skill and professional ethics through training and performance reviews.
  2. Tasks: Execute daily operations to protect the production environment. Continuously explore new concepts and optimize system architecture. This includes proces management, capacity planning, emergency response, monitoring, and security.
  3. Assets: Manage all physical and logical resources (data centers, servers, networks, software). The goal is clear configuration and lifecycle management: knowing where an asset comes from, where it is, and where it is going.
  4. Processes and Standards: The glue that binds people, tasks, and assets into a smooth, efficient, and stable operational stream.

Core Competency Model for Operations Teams

  • Standards Implementation: Grounding operations in frameworks like ITIL, ISO20000, ITSS, and compliance mandates (e.g., China's Multi-Level Protection Scheme).
  • Fundamental Guarantees: Configuration management, monitoring, app release, resource scaling, event/incident handling.
  • Basic Technical Skills: Proficiency in networks, servers, OS, databases, middleware, JVM tuning.
  • Business Service: SLA management, service desk, business consulting, experience databases.
  • Availability Management: Routine inspections, business continuity, high-availability architecture, spare parts redundancy.
  • Risk and Security: Operational auditing, regulatory risk, vulnerability and attack management.
  • Incident and Problem Management
  • Continuous Delivery: Application changes, infrastructure delivery, office services.
  • Proactive Optimization: Architecture, performance, and user experience enhancements.
  • Emergency Drills: Testing high-availability setups, incident response plans, and team readiness.
  • Business Support: Data maintenance, extraction, and parameter management.
  • Operational Analysis: Capacity, performance, and availability analytics.
  • Operational Enablement: Identifying and solving business pain points, enhancing customer experience.
  • Cost Control: Optimizing spend on personnel, hardware, bandwidth, and software.
  • Platform Engineering: Building internal automation tools and cultivating a DevOps culture.

Team Communication and Continuous Improvement

Structured communication is vital for an operations team:

  • Daily Stand-up: 10 minutes.
  • Evening Sync-up: 30 minutes.
  • Weekly Review

The Deming Cycle (PDCA) is the ideal model for achieving sustainable improvement in an IT operations system:

  • Plan: Define objectives and processes.
  • Do: Implement the plan.
  • Check: Monitor and measure results.
  • Act: Take action to standardize or improve.

Tags: IT Operations ITIL devops SRE High Availability

Posted on Wed, 13 May 2026 17:03:31 +0000 by kante