MLOps Practices
MLOps operationalises machine learning. It turns the research-and-notebook world of data science into reliable, governed, production-grade ML systems.
The ML model lifecycle
The ML lifecycle has eight phases. MLOps provides the practices, tools, and automation to move through them reliably:
Problem framing → Data engineering → Experimentation → Model training
→ Evaluation → Deployment → Monitoring → Retraining| Phase | What happens | MLOps practice |
|---|---|---|
| Problem framing | Define the ML task, success metrics, and data requirements | ML canvas, feasibility assessment |
| Data engineering | Collect, clean, validate, and version training data | Data pipelines, feature stores, data contracts |
| Experimentation | Train candidate models, tune hyperparameters, compare runs | Experiment tracking (MLflow, W&B) |
| Evaluation | Assess model performance, fairness, and risk | Automated evaluation gates, model cards |
| Deployment | Release model to a serving environment | CI/CD pipeline, canary/shadow deployments |
| Monitoring | Track model performance, data drift, and concept drift in production | Model observability, dashboards, alerts |
| Retraining | Retrain model when performance degrades or distribution shifts | Triggered retraining pipelines |
Core MLOps practices
1. Experiment tracking
Every model training run should be tracked with full reproducibility:
- What to track: hyperparameters, dataset version, code commit, training metrics (loss, accuracy, F1), evaluation metrics, runtime
- Why it matters: without tracking, you cannot reproduce a model, compare experiments, or audit which model was deployed when
- Tools: MLflow Tracking, Weights & Biases, Neptune, Azure ML Experiments, SageMaker Experiments
Experiment → Run (hyperparams + metrics + artifacts) → Compare → Promote best run2. Data versioning and feature stores
Models are only as good as their training data. MLOps requires data to be:
- Versioned: each training run is linked to a specific snapshot of training data
- Validated: data quality checks run before training (schema validation, null checks, distribution tests)
- Governed: lineage is tracked from raw source to feature to model
A feature store centralises engineered features so they can be reused across models and teams, and ensures consistency between training and serving (eliminating the training-serving skew problem).
| Layer | Role |
|---|---|
| Offline store | Historical features for training (batch) |
| Online store | Low-latency features for real-time inference |
| Feature registry | Metadata, ownership, and documentation for each feature |
Tools: Feast, Tecton, Vertex AI Feature Store, SageMaker Feature Store, Databricks Feature Engineering.
3. CI/CD for ML
CI/CD in ML goes beyond deploying code — it includes:
- CI (Continuous Integration): automated data validation, unit tests for feature engineering logic, model training on a subset, quality gate checks
- CD (Continuous Delivery): build the model artifact, package it, publish to a model registry, deploy to staging
- CT (Continuous Training): the ML-specific addition — retrain the model when data drifts or a schedule triggers, run evaluation, and promote automatically if quality gates pass
Code commit → Data validation → Train on sample → Evaluate → Gate check
→ Build artifact → Deploy to staging → Integration test → Deploy to productionCT (Continuous Training) is the most impactful MLOps practice for production ML. A model that isn't retrained becomes stale as the real world changes.
4. Model registry and versioning
A model registry is the central catalogue of trained model artifacts. It provides:
- Versioning: every trained model is given a unique version with its metadata
- Lifecycle stages: Staging → Production → Archived
- Audit trail: who promoted which model, when, and based on which evaluation results
- Rollback: revert to a previous model version if the new one underperforms
Tools: MLflow Model Registry, Vertex AI Model Registry, SageMaker Model Registry, Hugging Face Hub, Azure ML Model Registry.
5. Deployment patterns
| Pattern | Description | When to use |
|---|---|---|
| Online serving | Real-time REST/gRPC inference endpoint | Low-latency applications (fraud detection, recommendations) |
| Batch scoring | Run predictions on a dataset on a schedule | High-volume, latency-tolerant workflows (nightly reports) |
| Streaming inference | Predictions on a data stream (Kafka, Kinesis) | Event-driven real-time systems |
| Canary deployment | Route a small % of traffic to the new model | Safely validate in production before full rollout |
| Shadow deployment | New model receives traffic but predictions are not served | Risk-free production validation |
| A/B testing | Split traffic between model versions and compare business metrics | Optimise for business outcome, not just ML metrics |
6. Model monitoring and drift detection
A deployed model degrades over time. MLOps requires continuous monitoring of:
Data drift: the statistical distribution of input features changes from what the model was trained on. Concept drift: the relationship between inputs and the target variable changes (the model's assumptions are no longer valid). Model performance drift: accuracy, precision, recall, or business metrics degrade. Infrastructure metrics: latency, throughput, error rate of the serving endpoint.
Input data → Statistical comparison vs training baseline → Drift score
→ Alert threshold → Trigger retraining pipeline or human reviewTools: Evidently AI, Arize AI, WhyLabs, Fiddler, Azure ML Data Drift, SageMaker Model Monitor.
MLOps maturity levels
| Level | Description |
|---|---|
| 0 — Manual | Models built in notebooks, deployed manually (if at all), no monitoring |
| 1 — Tracked | Experiments tracked, models versioned, basic CI/CD, some monitoring |
| 2 — Automated | Full CI/CD/CT pipelines, automated retraining, drift detection, model registry |
| 3 — Governed | Full lineage, fairness checks, explainability, regulatory audit trail, enterprise governance |
Most organisations are at Level 0 or 1. Level 2 is the target for teams with models in production. Level 3 is required in regulated industries.
MLOps aligned to ITIL 5 PSLM
| PSLM Activity | MLOps role |
|---|---|
| Discover | Problem framing, feasibility assessment, data availability review |
| Design | ML canvas, model architecture decisions, feature design, evaluation criteria |
| Acquire | Data acquisition, labelling contracts, compute procurement, platform licences |
| Build | Model training, experiment tracking, hyperparameter tuning, model evaluation |
| Transition | Model registry promotion, staging validation, canary/shadow deployment |
| Operate | Model serving infrastructure monitoring, latency and throughput tracking |
| Deliver | Batch scoring runs, API access for consuming applications |
| Support | Drift detection, incident response for model degradation, rollback |
Key metrics
| Metric | What it measures | Target (typical) |
|---|---|---|
| Model accuracy / F1 / AUC | Predictive performance on held-out test set | Baseline + regression test |
| Training pipeline success rate | % of training runs that complete without error | ≥ 99% |
| Deployment frequency | How often new model versions are released to production | Weekly or triggered |
| MTTR for model incidents | Time from drift detection to model restored or rolled back | < 4 hours |
| Data freshness | Age of the most recent training data snapshot | Depends on domain |
| Inference latency (p99) | 99th percentile response time for real-time endpoints | < 100ms for real-time |