AIOpsIT OperationsObservability

AIOps Practices

AIOps applies artificial intelligence to IT operations data — events, logs, metrics, and traces — to detect problems faster, correlate signals intelligently, and automate remediation at a scale no human team can match alone.

The AIOps data pipeline

AIOps operates on observability data produced by your infrastructure, applications, and services:

Raw signals (metrics, logs, events, traces)
    → Ingestion & normalisation
    → Anomaly detection
    → Event correlation
    → Root cause analysis
    → Automated remediation OR human alert (ranked, enriched)

Layer	Description
Data ingestion	Collect metrics, logs, events, traces from monitoring tools, CMDB, and change management
Normalisation	Standardise formats, deduplicate, and enrich with CMDB context (which CI, which service, which team)
Anomaly detection	Identify deviations from baseline using statistical and ML models
Correlation	Group related alerts into a single incident topology — reduce 1,000 alerts to 3 actionable incidents
Root cause analysis	Trace the incident to its origin using topology graphs, historical patterns, and change correlation
Remediation	Trigger automated runbooks for known error patterns; route novel incidents to the right human team

Core AIOps capabilities

1. Anomaly detection

Traditional monitoring uses static thresholds. AIOps uses dynamic baselines:

Metric anomaly detection: detects when a KPI deviates from its normal range, accounting for time-of-day patterns, weekly seasonality, and trend
Log anomaly detection: identifies unusual log patterns or new error signatures without requiring explicit rules
KPI forecasting: predicts future metric values and alerts before a threshold is breached

Why static thresholds fail: a 90% CPU utilisation at 2 AM on a batch processing server is normal. The same metric at 2 PM on a web server is critical. Static rules cannot capture context.

Historical metric data → Seasonal decomposition → Dynamic baseline
    → Real-time deviation score → Alert if above threshold

Tools: Dynatrace Davis AI, Datadog Watchdog, Splunk ITSI, New Relic NRQL Anomaly Detection, AWS DevOps Guru.

2. Event correlation and noise reduction

Large enterprises generate millions of monitoring events per day. AIOps correlation:

Groups related alerts from different monitoring tools into a single incident
Suppresses noise — derivative alerts triggered by a single root cause
Enriches each incident with CMDB data (owning team, service tier, business impact)
Ranks incidents by business impact, not just technical severity

The result: an SRE sees 5 high-fidelity incidents instead of 5,000 raw alerts.

Without AIOps	With AIOps
5,000 raw alerts per hour	12 correlated incidents per hour
Alert fatigue → missed P1s	High-SNR queue → every item is actionable
No context on business impact	Each incident linked to affected service and owning team
Manual correlation takes 30 minutes	Correlation runs in seconds

Tools: BigPanda, Moogsoft, ServiceNow Event Management, PagerDuty AIOps, IBM Instana.

3. Root cause analysis (RCA)

Identifying the root cause of an incident in a microservices environment is hard. AIOps automates this by:

Topology-aware correlation: maps the relationship between infrastructure CIs, services, and applications (using the CMDB or auto-discovered dependency maps)
Change correlation: checks whether a recent deployment, configuration change, or infrastructure modification coincides with the incident start
Pattern matching: compares current incident signatures against historical incidents with known causes and resolutions
Blast radius mapping: shows which other services are likely to be affected

Incident detected → Query CMDB topology → Identify impacted CIs
    → Check recent changes → Match historical patterns
    → Ranked list of probable root causes with confidence scores

AIOps RCA does not eliminate human judgement — it focuses the expert's attention on the most probable cause rather than requiring them to search blind.

4. Predictive alerting and capacity forecasting

AIOps looks ahead, not just at what's happening now:

Capacity forecasting: predicts when a storage volume, memory pool, or database connection pool will be exhausted
Degradation prediction: detects early signals of service deterioration (rising error rates, increasing p99 latency) before they breach SLA thresholds
Maintenance window optimisation: recommends the lowest-risk time window for planned changes based on historical traffic patterns

5. Automated remediation

The highest-value AIOps capability — and the one that requires the most governance:

Automation level	Description	ITIL 5 change type
Diagnostics	Automatically gather logs, traces, and CMDB data when an incident opens	No change — information only
Runbook automation	Execute predefined runbooks for known error patterns (restart service, clear cache, scale pod)	Standard change — pre-approved
Autonomous remediation	AI selects and executes the remediation without human approval	Standard change — requires strong governance
Human-in-the-loop	AI recommends an action; human approves before execution	Normal change — lightweight approval

⚠️

Autonomous remediation without proper governance can make incidents worse. Apply ITIL 5 change enablement principles: automate standard changes first, preserve human approval gates for anything with meaningful blast radius.

6. Change impact analysis

Before deploying a change, AIOps can assess its risk:

Analyse historical incidents correlated with changes to similar CIs
Map all services that depend on the CI being changed
Score the change risk (probability of incident × business impact)
Recommend scheduling the change in a low-risk window
Flag changes that are likely to affect SLA-critical services

This directly supports ITIL 5's Change Enablement capability — using C4 (Cognition) to make change risk scoring data-driven rather than opinion-driven.

AIOps and observability

AIOps requires a strong observability foundation. The three pillars:

Pillar	Description	AIOps use
Metrics	Numerical time-series data (CPU, latency, error rate)	Anomaly detection, capacity forecasting
Logs	Structured or unstructured text records of events	Log anomaly detection, RCA pattern matching
Traces	End-to-end request journeys across distributed services	Latency attribution, dependency mapping

Without sufficient observability data, AIOps models have nothing to learn from. Observability is the prerequisite, not the destination.

ITIL 5 connection: observability is formally defined in ITIL 5 as "the ability to understand the internal state of a complicated system by analysing its external outputs." It is a key input to the Operate and Support activities of the PSLM.

AIOps aligned to ITIL 5

ITIL 5 element	AIOps contribution
Operate activity	Anomaly detection, predictive alerting, infrastructure monitoring
Support activity	Event correlation, RCA, automated runbook execution
Incident Management practice	AI-first triage, priority scoring, noise suppression
Problem Management practice	Pattern-based RCA, known error identification
Change Enablement practice	Change risk scoring, impact analysis, deployment windows
C2 — Curation	Alert filtering, noise reduction, signal prioritisation
C4 — Cognition	Root cause reasoning, change risk prediction, incident impact scoring
C6 — Coordination	Autonomous remediation workflows with approval gates

Key AIOps metrics

Metric	What it measures	Target
Alert-to-incident ratio	How effectively alerts are correlated into incidents	< 1:50 (50 alerts → 1 incident)
MTTD (Mean Time to Detect)	Time from issue start to alert firing	< 5 minutes for P1 services
MTTR (Mean Time to Restore)	Time from detection to service restoration	< 1 hour for T1 services
False positive rate	% of alerts that do not correspond to real issues	< 10%
Automation rate	% of incidents resolved without human intervention	≥ 40% (standard changes)
Alert fatigue index	Ratio of actionable alerts to total alerts per operator per shift	> 60% actionable

AIOps implementation roadmap

Phase 1 — Observe (0–3 months)

Centralise observability data (metrics, logs, events) into a single platform
Connect your CMDB to your event management tool
Establish baseline metrics for all tier-1 services

Phase 2 — Correlate (3–6 months)

Enable event correlation and noise reduction
Start RCA correlation with change management data
Train your team on triaging AIOps-correlated incidents

Phase 3 — Predict (6–12 months)

Activate anomaly detection with dynamic baselines
Enable capacity forecasting for critical infrastructure
Integrate change risk scoring into your change enablement process

Phase 4 — Automate (12+ months)

Implement runbook automation for the top 20 most common incident types
Establish human-in-the-loop approval for autonomous remediation
Build a continuous improvement loop: every false positive improves the model