⚙️ MLOps & AIOps
AIOps Practices

AIOps Practices

AIOps applies artificial intelligence to IT operations data — events, logs, metrics, and traces — to detect problems faster, correlate signals intelligently, and automate remediation at a scale no human team can match alone.


The AIOps data pipeline

AIOps operates on observability data produced by your infrastructure, applications, and services:

Raw signals (metrics, logs, events, traces)
    → Ingestion & normalisation
    → Anomaly detection
    → Event correlation
    → Root cause analysis
    → Automated remediation OR human alert (ranked, enriched)
LayerDescription
Data ingestionCollect metrics, logs, events, traces from monitoring tools, CMDB, and change management
NormalisationStandardise formats, deduplicate, and enrich with CMDB context (which CI, which service, which team)
Anomaly detectionIdentify deviations from baseline using statistical and ML models
CorrelationGroup related alerts into a single incident topology — reduce 1,000 alerts to 3 actionable incidents
Root cause analysisTrace the incident to its origin using topology graphs, historical patterns, and change correlation
RemediationTrigger automated runbooks for known error patterns; route novel incidents to the right human team

Core AIOps capabilities

1. Anomaly detection

Traditional monitoring uses static thresholds. AIOps uses dynamic baselines:

  • Metric anomaly detection: detects when a KPI deviates from its normal range, accounting for time-of-day patterns, weekly seasonality, and trend
  • Log anomaly detection: identifies unusual log patterns or new error signatures without requiring explicit rules
  • KPI forecasting: predicts future metric values and alerts before a threshold is breached

Why static thresholds fail: a 90% CPU utilisation at 2 AM on a batch processing server is normal. The same metric at 2 PM on a web server is critical. Static rules cannot capture context.

Historical metric data → Seasonal decomposition → Dynamic baseline
    → Real-time deviation score → Alert if above threshold

Tools: Dynatrace Davis AI, Datadog Watchdog, Splunk ITSI, New Relic NRQL Anomaly Detection, AWS DevOps Guru.

2. Event correlation and noise reduction

Large enterprises generate millions of monitoring events per day. AIOps correlation:

  • Groups related alerts from different monitoring tools into a single incident
  • Suppresses noise — derivative alerts triggered by a single root cause
  • Enriches each incident with CMDB data (owning team, service tier, business impact)
  • Ranks incidents by business impact, not just technical severity

The result: an SRE sees 5 high-fidelity incidents instead of 5,000 raw alerts.

Without AIOpsWith AIOps
5,000 raw alerts per hour12 correlated incidents per hour
Alert fatigue → missed P1sHigh-SNR queue → every item is actionable
No context on business impactEach incident linked to affected service and owning team
Manual correlation takes 30 minutesCorrelation runs in seconds

Tools: BigPanda, Moogsoft, ServiceNow Event Management, PagerDuty AIOps, IBM Instana.

3. Root cause analysis (RCA)

Identifying the root cause of an incident in a microservices environment is hard. AIOps automates this by:

  • Topology-aware correlation: maps the relationship between infrastructure CIs, services, and applications (using the CMDB or auto-discovered dependency maps)
  • Change correlation: checks whether a recent deployment, configuration change, or infrastructure modification coincides with the incident start
  • Pattern matching: compares current incident signatures against historical incidents with known causes and resolutions
  • Blast radius mapping: shows which other services are likely to be affected
Incident detected → Query CMDB topology → Identify impacted CIs
    → Check recent changes → Match historical patterns
    → Ranked list of probable root causes with confidence scores

AIOps RCA does not eliminate human judgement — it focuses the expert's attention on the most probable cause rather than requiring them to search blind.

4. Predictive alerting and capacity forecasting

AIOps looks ahead, not just at what's happening now:

  • Capacity forecasting: predicts when a storage volume, memory pool, or database connection pool will be exhausted
  • Degradation prediction: detects early signals of service deterioration (rising error rates, increasing p99 latency) before they breach SLA thresholds
  • Maintenance window optimisation: recommends the lowest-risk time window for planned changes based on historical traffic patterns

5. Automated remediation

The highest-value AIOps capability — and the one that requires the most governance:

Automation levelDescriptionITIL 5 change type
DiagnosticsAutomatically gather logs, traces, and CMDB data when an incident opensNo change — information only
Runbook automationExecute predefined runbooks for known error patterns (restart service, clear cache, scale pod)Standard change — pre-approved
Autonomous remediationAI selects and executes the remediation without human approvalStandard change — requires strong governance
Human-in-the-loopAI recommends an action; human approves before executionNormal change — lightweight approval
⚠️

Autonomous remediation without proper governance can make incidents worse. Apply ITIL 5 change enablement principles: automate standard changes first, preserve human approval gates for anything with meaningful blast radius.

6. Change impact analysis

Before deploying a change, AIOps can assess its risk:

  • Analyse historical incidents correlated with changes to similar CIs
  • Map all services that depend on the CI being changed
  • Score the change risk (probability of incident × business impact)
  • Recommend scheduling the change in a low-risk window
  • Flag changes that are likely to affect SLA-critical services

This directly supports ITIL 5's Change Enablement capability — using C4 (Cognition) to make change risk scoring data-driven rather than opinion-driven.


AIOps and observability

AIOps requires a strong observability foundation. The three pillars:

PillarDescriptionAIOps use
MetricsNumerical time-series data (CPU, latency, error rate)Anomaly detection, capacity forecasting
LogsStructured or unstructured text records of eventsLog anomaly detection, RCA pattern matching
TracesEnd-to-end request journeys across distributed servicesLatency attribution, dependency mapping

Without sufficient observability data, AIOps models have nothing to learn from. Observability is the prerequisite, not the destination.

ITIL 5 connection: observability is formally defined in ITIL 5 as "the ability to understand the internal state of a complicated system by analysing its external outputs." It is a key input to the Operate and Support activities of the PSLM.


AIOps aligned to ITIL 5

ITIL 5 elementAIOps contribution
Operate activityAnomaly detection, predictive alerting, infrastructure monitoring
Support activityEvent correlation, RCA, automated runbook execution
Incident Management practiceAI-first triage, priority scoring, noise suppression
Problem Management practicePattern-based RCA, known error identification
Change Enablement practiceChange risk scoring, impact analysis, deployment windows
C2 — CurationAlert filtering, noise reduction, signal prioritisation
C4 — CognitionRoot cause reasoning, change risk prediction, incident impact scoring
C6 — CoordinationAutonomous remediation workflows with approval gates

Key AIOps metrics

MetricWhat it measuresTarget
Alert-to-incident ratioHow effectively alerts are correlated into incidents< 1:50 (50 alerts → 1 incident)
MTTD (Mean Time to Detect)Time from issue start to alert firing< 5 minutes for P1 services
MTTR (Mean Time to Restore)Time from detection to service restoration< 1 hour for T1 services
False positive rate% of alerts that do not correspond to real issues< 10%
Automation rate% of incidents resolved without human intervention≥ 40% (standard changes)
Alert fatigue indexRatio of actionable alerts to total alerts per operator per shift> 60% actionable

AIOps implementation roadmap

Phase 1 — Observe (0–3 months)

  • Centralise observability data (metrics, logs, events) into a single platform
  • Connect your CMDB to your event management tool
  • Establish baseline metrics for all tier-1 services

Phase 2 — Correlate (3–6 months)

  • Enable event correlation and noise reduction
  • Start RCA correlation with change management data
  • Train your team on triaging AIOps-correlated incidents

Phase 3 — Predict (6–12 months)

  • Activate anomaly detection with dynamic baselines
  • Enable capacity forecasting for critical infrastructure
  • Integrate change risk scoring into your change enablement process

Phase 4 — Automate (12+ months)

  • Implement runbook automation for the top 20 most common incident types
  • Establish human-in-the-loop approval for autonomous remediation
  • Build a continuous improvement loop: every false positive improves the model

Further reading

Digital Kimya — MENA & Europe

Ready to implement what you've read?

Our ITSM practitioners deliver ITIL 4 & 5 projects across ServiceNow, Jira SM, SMAX and BMC Helix — from initial assessment to full ESM deployment.

🚀 ITIL Implementation🔧 ITSM Platform Setup📊 Assessment & Roadmap🏭 Industry-Specific Projects
🌍 MENA & Europe🎯 ITIL 4 & 5 Certified🏢 6 Industries covered Assessment in 2 weeks
contact@digitalkimya.net