AIOps Practices
AIOps applies artificial intelligence to IT operations data — events, logs, metrics, and traces — to detect problems faster, correlate signals intelligently, and automate remediation at a scale no human team can match alone.
The AIOps data pipeline
AIOps operates on observability data produced by your infrastructure, applications, and services:
Raw signals (metrics, logs, events, traces)
→ Ingestion & normalisation
→ Anomaly detection
→ Event correlation
→ Root cause analysis
→ Automated remediation OR human alert (ranked, enriched)| Layer | Description |
|---|---|
| Data ingestion | Collect metrics, logs, events, traces from monitoring tools, CMDB, and change management |
| Normalisation | Standardise formats, deduplicate, and enrich with CMDB context (which CI, which service, which team) |
| Anomaly detection | Identify deviations from baseline using statistical and ML models |
| Correlation | Group related alerts into a single incident topology — reduce 1,000 alerts to 3 actionable incidents |
| Root cause analysis | Trace the incident to its origin using topology graphs, historical patterns, and change correlation |
| Remediation | Trigger automated runbooks for known error patterns; route novel incidents to the right human team |
Core AIOps capabilities
1. Anomaly detection
Traditional monitoring uses static thresholds. AIOps uses dynamic baselines:
- Metric anomaly detection: detects when a KPI deviates from its normal range, accounting for time-of-day patterns, weekly seasonality, and trend
- Log anomaly detection: identifies unusual log patterns or new error signatures without requiring explicit rules
- KPI forecasting: predicts future metric values and alerts before a threshold is breached
Why static thresholds fail: a 90% CPU utilisation at 2 AM on a batch processing server is normal. The same metric at 2 PM on a web server is critical. Static rules cannot capture context.
Historical metric data → Seasonal decomposition → Dynamic baseline
→ Real-time deviation score → Alert if above thresholdTools: Dynatrace Davis AI, Datadog Watchdog, Splunk ITSI, New Relic NRQL Anomaly Detection, AWS DevOps Guru.
2. Event correlation and noise reduction
Large enterprises generate millions of monitoring events per day. AIOps correlation:
- Groups related alerts from different monitoring tools into a single incident
- Suppresses noise — derivative alerts triggered by a single root cause
- Enriches each incident with CMDB data (owning team, service tier, business impact)
- Ranks incidents by business impact, not just technical severity
The result: an SRE sees 5 high-fidelity incidents instead of 5,000 raw alerts.
| Without AIOps | With AIOps |
|---|---|
| 5,000 raw alerts per hour | 12 correlated incidents per hour |
| Alert fatigue → missed P1s | High-SNR queue → every item is actionable |
| No context on business impact | Each incident linked to affected service and owning team |
| Manual correlation takes 30 minutes | Correlation runs in seconds |
Tools: BigPanda, Moogsoft, ServiceNow Event Management, PagerDuty AIOps, IBM Instana.
3. Root cause analysis (RCA)
Identifying the root cause of an incident in a microservices environment is hard. AIOps automates this by:
- Topology-aware correlation: maps the relationship between infrastructure CIs, services, and applications (using the CMDB or auto-discovered dependency maps)
- Change correlation: checks whether a recent deployment, configuration change, or infrastructure modification coincides with the incident start
- Pattern matching: compares current incident signatures against historical incidents with known causes and resolutions
- Blast radius mapping: shows which other services are likely to be affected
Incident detected → Query CMDB topology → Identify impacted CIs
→ Check recent changes → Match historical patterns
→ Ranked list of probable root causes with confidence scoresAIOps RCA does not eliminate human judgement — it focuses the expert's attention on the most probable cause rather than requiring them to search blind.
4. Predictive alerting and capacity forecasting
AIOps looks ahead, not just at what's happening now:
- Capacity forecasting: predicts when a storage volume, memory pool, or database connection pool will be exhausted
- Degradation prediction: detects early signals of service deterioration (rising error rates, increasing p99 latency) before they breach SLA thresholds
- Maintenance window optimisation: recommends the lowest-risk time window for planned changes based on historical traffic patterns
5. Automated remediation
The highest-value AIOps capability — and the one that requires the most governance:
| Automation level | Description | ITIL 5 change type |
|---|---|---|
| Diagnostics | Automatically gather logs, traces, and CMDB data when an incident opens | No change — information only |
| Runbook automation | Execute predefined runbooks for known error patterns (restart service, clear cache, scale pod) | Standard change — pre-approved |
| Autonomous remediation | AI selects and executes the remediation without human approval | Standard change — requires strong governance |
| Human-in-the-loop | AI recommends an action; human approves before execution | Normal change — lightweight approval |
Autonomous remediation without proper governance can make incidents worse. Apply ITIL 5 change enablement principles: automate standard changes first, preserve human approval gates for anything with meaningful blast radius.
6. Change impact analysis
Before deploying a change, AIOps can assess its risk:
- Analyse historical incidents correlated with changes to similar CIs
- Map all services that depend on the CI being changed
- Score the change risk (probability of incident × business impact)
- Recommend scheduling the change in a low-risk window
- Flag changes that are likely to affect SLA-critical services
This directly supports ITIL 5's Change Enablement capability — using C4 (Cognition) to make change risk scoring data-driven rather than opinion-driven.
AIOps and observability
AIOps requires a strong observability foundation. The three pillars:
| Pillar | Description | AIOps use |
|---|---|---|
| Metrics | Numerical time-series data (CPU, latency, error rate) | Anomaly detection, capacity forecasting |
| Logs | Structured or unstructured text records of events | Log anomaly detection, RCA pattern matching |
| Traces | End-to-end request journeys across distributed services | Latency attribution, dependency mapping |
Without sufficient observability data, AIOps models have nothing to learn from. Observability is the prerequisite, not the destination.
ITIL 5 connection: observability is formally defined in ITIL 5 as "the ability to understand the internal state of a complicated system by analysing its external outputs." It is a key input to the Operate and Support activities of the PSLM.
AIOps aligned to ITIL 5
| ITIL 5 element | AIOps contribution |
|---|---|
| Operate activity | Anomaly detection, predictive alerting, infrastructure monitoring |
| Support activity | Event correlation, RCA, automated runbook execution |
| Incident Management practice | AI-first triage, priority scoring, noise suppression |
| Problem Management practice | Pattern-based RCA, known error identification |
| Change Enablement practice | Change risk scoring, impact analysis, deployment windows |
| C2 — Curation | Alert filtering, noise reduction, signal prioritisation |
| C4 — Cognition | Root cause reasoning, change risk prediction, incident impact scoring |
| C6 — Coordination | Autonomous remediation workflows with approval gates |
Key AIOps metrics
| Metric | What it measures | Target |
|---|---|---|
| Alert-to-incident ratio | How effectively alerts are correlated into incidents | < 1:50 (50 alerts → 1 incident) |
| MTTD (Mean Time to Detect) | Time from issue start to alert firing | < 5 minutes for P1 services |
| MTTR (Mean Time to Restore) | Time from detection to service restoration | < 1 hour for T1 services |
| False positive rate | % of alerts that do not correspond to real issues | < 10% |
| Automation rate | % of incidents resolved without human intervention | ≥ 40% (standard changes) |
| Alert fatigue index | Ratio of actionable alerts to total alerts per operator per shift | > 60% actionable |
AIOps implementation roadmap
Phase 1 — Observe (0–3 months)
- Centralise observability data (metrics, logs, events) into a single platform
- Connect your CMDB to your event management tool
- Establish baseline metrics for all tier-1 services
Phase 2 — Correlate (3–6 months)
- Enable event correlation and noise reduction
- Start RCA correlation with change management data
- Train your team on triaging AIOps-correlated incidents
Phase 3 — Predict (6–12 months)
- Activate anomaly detection with dynamic baselines
- Enable capacity forecasting for critical infrastructure
- Integrate change risk scoring into your change enablement process
Phase 4 — Automate (12+ months)
- Implement runbook automation for the top 20 most common incident types
- Establish human-in-the-loop approval for autonomous remediation
- Build a continuous improvement loop: every false positive improves the model