MLOpsML LifecycleModel Deployment

MLOps Practices

MLOps operationalises machine learning. It turns the research-and-notebook world of data science into reliable, governed, production-grade ML systems.

The ML model lifecycle

The ML lifecycle has eight phases. MLOps provides the practices, tools, and automation to move through them reliably:

Problem framing → Data engineering → Experimentation → Model training
    → Evaluation → Deployment → Monitoring → Retraining

Phase	What happens	MLOps practice
Problem framing	Define the ML task, success metrics, and data requirements	ML canvas, feasibility assessment
Data engineering	Collect, clean, validate, and version training data	Data pipelines, feature stores, data contracts
Experimentation	Train candidate models, tune hyperparameters, compare runs	Experiment tracking (MLflow, W&B)
Evaluation	Assess model performance, fairness, and risk	Automated evaluation gates, model cards
Deployment	Release model to a serving environment	CI/CD pipeline, canary/shadow deployments
Monitoring	Track model performance, data drift, and concept drift in production	Model observability, dashboards, alerts
Retraining	Retrain model when performance degrades or distribution shifts	Triggered retraining pipelines

Core MLOps practices

1. Experiment tracking

Every model training run should be tracked with full reproducibility:

What to track: hyperparameters, dataset version, code commit, training metrics (loss, accuracy, F1), evaluation metrics, runtime
Why it matters: without tracking, you cannot reproduce a model, compare experiments, or audit which model was deployed when
Tools: MLflow Tracking, Weights & Biases, Neptune, Azure ML Experiments, SageMaker Experiments

Experiment → Run (hyperparams + metrics + artifacts) → Compare → Promote best run

2. Data versioning and feature stores

Models are only as good as their training data. MLOps requires data to be:

Versioned: each training run is linked to a specific snapshot of training data
Validated: data quality checks run before training (schema validation, null checks, distribution tests)
Governed: lineage is tracked from raw source to feature to model

A feature store centralises engineered features so they can be reused across models and teams, and ensures consistency between training and serving (eliminating the training-serving skew problem).

Layer	Role
Offline store	Historical features for training (batch)
Online store	Low-latency features for real-time inference
Feature registry	Metadata, ownership, and documentation for each feature

Tools: Feast, Tecton, Vertex AI Feature Store, SageMaker Feature Store, Databricks Feature Engineering.

3. CI/CD for ML

CI/CD in ML goes beyond deploying code — it includes:

CI (Continuous Integration): automated data validation, unit tests for feature engineering logic, model training on a subset, quality gate checks
CD (Continuous Delivery): build the model artifact, package it, publish to a model registry, deploy to staging
CT (Continuous Training): the ML-specific addition — retrain the model when data drifts or a schedule triggers, run evaluation, and promote automatically if quality gates pass

Code commit → Data validation → Train on sample → Evaluate → Gate check
    → Build artifact → Deploy to staging → Integration test → Deploy to production

CT (Continuous Training) is the most impactful MLOps practice for production ML. A model that isn't retrained becomes stale as the real world changes.

4. Model registry and versioning

A model registry is the central catalogue of trained model artifacts. It provides:

Versioning: every trained model is given a unique version with its metadata
Lifecycle stages: Staging → Production → Archived
Audit trail: who promoted which model, when, and based on which evaluation results
Rollback: revert to a previous model version if the new one underperforms

Tools: MLflow Model Registry, Vertex AI Model Registry, SageMaker Model Registry, Hugging Face Hub, Azure ML Model Registry.

5. Deployment patterns

Pattern	Description	When to use
Online serving	Real-time REST/gRPC inference endpoint	Low-latency applications (fraud detection, recommendations)
Batch scoring	Run predictions on a dataset on a schedule	High-volume, latency-tolerant workflows (nightly reports)
Streaming inference	Predictions on a data stream (Kafka, Kinesis)	Event-driven real-time systems
Canary deployment	Route a small % of traffic to the new model	Safely validate in production before full rollout
Shadow deployment	New model receives traffic but predictions are not served	Risk-free production validation
A/B testing	Split traffic between model versions and compare business metrics	Optimise for business outcome, not just ML metrics

6. Model monitoring and drift detection

A deployed model degrades over time. MLOps requires continuous monitoring of:

Data drift: the statistical distribution of input features changes from what the model was trained on. Concept drift: the relationship between inputs and the target variable changes (the model's assumptions are no longer valid). Model performance drift: accuracy, precision, recall, or business metrics degrade. Infrastructure metrics: latency, throughput, error rate of the serving endpoint.

Input data → Statistical comparison vs training baseline → Drift score
    → Alert threshold → Trigger retraining pipeline or human review

Tools: Evidently AI, Arize AI, WhyLabs, Fiddler, Azure ML Data Drift, SageMaker Model Monitor.

MLOps maturity levels

Level	Description
0 — Manual	Models built in notebooks, deployed manually (if at all), no monitoring
1 — Tracked	Experiments tracked, models versioned, basic CI/CD, some monitoring
2 — Automated	Full CI/CD/CT pipelines, automated retraining, drift detection, model registry
3 — Governed	Full lineage, fairness checks, explainability, regulatory audit trail, enterprise governance

Most organisations are at Level 0 or 1. Level 2 is the target for teams with models in production. Level 3 is required in regulated industries.

MLOps aligned to ITIL 5 PSLM

PSLM Activity	MLOps role
Discover	Problem framing, feasibility assessment, data availability review
Design	ML canvas, model architecture decisions, feature design, evaluation criteria
Acquire	Data acquisition, labelling contracts, compute procurement, platform licences
Build	Model training, experiment tracking, hyperparameter tuning, model evaluation
Transition	Model registry promotion, staging validation, canary/shadow deployment
Operate	Model serving infrastructure monitoring, latency and throughput tracking
Deliver	Batch scoring runs, API access for consuming applications
Support	Drift detection, incident response for model degradation, rollback

Key metrics

Metric	What it measures	Target (typical)
Model accuracy / F1 / AUC	Predictive performance on held-out test set	Baseline + regression test
Training pipeline success rate	% of training runs that complete without error	≥ 99%
Deployment frequency	How often new model versions are released to production	Weekly or triggered
MTTR for model incidents	Time from drift detection to model restored or rolled back	< 4 hours
Data freshness	Age of the most recent training data snapshot	Depends on domain
Inference latency (p99)	99th percentile response time for real-time endpoints	< 100ms for real-time