A comprehensive interactive exploration of Explainable AI — the explanation pipeline, 8-layer stack, explanation methods, SHAP/LIME, mechanistic interpretability, benchmarks, market data, and more.
~56 min read · Interactive ReferenceExplainable AI follows a five-step pipeline from raw input through to human-readable explanations. Click each step to learn more.
Explore how data flows through the explanation pipeline — from raw features to final human-understandable explanations.
┌──────────────────────────────────────────────────────────────────────────┐
│ EXPLAINABLE AI — EXPLANATION PIPELINE │
│ │
│ INPUT MODEL RAW OUTPUT EXPLANATION │
│ ────── ────── ────────── ─────────── │
│ │
│ Features ──► Trained AI ──► Prediction ──► "SHAP says: │
│ (data) Model (class/score) income was │
│ (black box) most important │
│ feature; age was │
│ │ second" │
│ │ │
│ ▼ │
│ EXPLANATION ENGINE │
│ ───────────────── │
│ SHAP | LIME | Grad-CAM | Attention │
│ Counterfactuals | Concept | Rules │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────┐ │
│ │ HUMAN-READABLE EXPLANATION │ │
│ │ • Feature attribution scores │ │
│ │ • Visual saliency maps │ │
│ │ • Counterfactual scenarios │ │
│ │ • Rule-based summaries │ │
│ │ • Natural language reasoning │ │
│ └──────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
| Step | What Happens |
|---|---|
| Input | Data instance (or dataset) is presented for explanation |
| Model Prediction | The AI model produces its output (classification, score, generation) |
| Explanation Method Selection | The appropriate XAI technique is chosen — depends on the model type, audience, and explanation need |
| Explanation Computation | The explanation engine analyses the model's behaviour — perturbing inputs, computing gradients, sampling local neighbourhoods, or extracting internal representations |
| Attribution / Reasoning | The technique produces attributions (feature importances), saliency maps, decision rules, counterfactuals, or concept-level reasoning |
| Rendering | The explanation is rendered in a human-understandable format — tables, charts, heatmaps, natural language |
| Human Evaluation | The end user (doctor, loan officer, auditor, data scientist) reviews the explanation |
| Parameter | What It Controls |
|---|---|
| Scope | Local (per-instance) vs. global (entire model) explanation |
| Audience | Technical (data scientist) vs. non-technical (business user, consumer) |
| Fidelity | How faithfully the explanation represents the model's actual reasoning |
| Granularity | Feature-level, concept-level, rule-level, or pixel-level attribution |
| Model Access | Black-box (only inputs/outputs) vs. white-box (access to model internals) |
| Perturbation Budget | Number of model evaluations allowed for perturbation-based methods (LIME, SHAP) |
| Background Data | Reference dataset used for computing Shapley values or baseline expectations |
SHAP values are grounded in cooperative game theory — specifically Shapley values from 1953.
The EU AI Act (2024) mandates explainability for all high-risk AI systems deployed in Europe.
LIME (Local Interpretable Model-agnostic Explanations) can explain any black-box model's individual predictions.
Test your understanding — select the best answer for each question.
Q1. What do SHAP values measure?
Q2. What is the difference between global and local explainability?
Q3. Which regulation mandates AI explainability for high-risk systems?
Explainable AI is organised into eight architectural layers. Click any layer to expand details.
| Layer | What It Covers |
|---|---|
| 1. Model Layer | The AI model being explained — architecture, weights, training data, objective |
| 2. Explanation Method | The XAI technique(s) applied — SHAP, LIME, Grad-CAM, counterfactuals, circuit analysis |
| 3. Attribution Engine | Core computation — Shapley values, gradient computation, perturbation sampling, activation extraction |
| 4. Aggregation & Summarisation | Combining per-instance explanations into global summaries; importance rankings; interaction analysis |
| 5. Visualisation & Rendering | Charts, heatmaps, saliency maps, interactive dashboards, natural language summaries |
| 6. Audience Adaptation | Tailoring explanations to the audience — data scientist vs. loan officer vs. patient vs. regulator |
| 7. Evaluation & Validation | Measuring explanation quality — fidelity, stability, human alignment, sufficiency |
| 8. Governance & Compliance | Explanation documentation, regulatory compliance (EU AI Act), audit trails, fairness assessments |
Eleven principal sub-types of explainability, organised by Scope, Format, and Access level.
| Type | Description | Example |
|---|---|---|
| Local Explanation | Explains a single prediction for one input instance | "This loan was denied primarily because income was below threshold" |
| Global Explanation | Explains the model's overall behaviour across the entire dataset | "The model relies primarily on income, credit score, and employment length" |
| Cohort Explanation | Explains model behaviour for a specific group or subpopulation | "For applicants in the 25–35 age range, the model weights education more heavily" |
| Format | Description | Best For |
|---|---|---|
| Feature Attribution | Numerical scores showing each feature's contribution to the prediction | Data scientists; tabular data |
| Saliency Maps / Heatmaps | Visual overlay showing which regions of an image influenced the prediction | Image classification; medical imaging |
| Counterfactual | "What would need to change for a different outcome?" | Business users; consumer explanations; fairness |
| Rules / Decision Logic | If-then rules or decision paths explaining the prediction | Regulatory compliance; auditors |
| Natural Language | Plain-English (or other language) explanation of the reasoning | Non-technical users; consumer-facing applications |
| Example-Based | "This prediction is similar to these training examples" | Intuitive understanding; prototype-based reasoning |
| Concept-Based | Attribution at the level of human-meaningful concepts, not raw features | High-level understanding; TCAV (Testing with Concept Activation Vectors) |
| Type | Access Needed | Examples |
|---|---|---|
| Black-Box | Only inputs and outputs | LIME, SHAP (KernelSHAP), counterfactuals, anchors |
| White-Box | Access to model internals (gradients, weights, activations) | Grad-CAM, integrated gradients, LRP, attention, circuit analysis |
| Inherent | Model is interpretable by design | Decision trees, linear models, EBMs, GAMs, rule lists |
Five families of explanation methods spanning inherently interpretable models to mechanistic interpretability.
| Model | Interpretability | Strengths | Limitations |
|---|---|---|---|
| Linear Regression | Coefficients directly show feature importance and direction | Trivially interpretable; fast | Limited to linear relationships |
| Logistic Regression | Coefficients as log-odds; feature contributions are additive | Strong interpretability for classification | Cannot capture non-linearity |
| Decision Trees | If-then-else rules; can be visualised as a tree | Intuitive; supports categorical and numerical data | Unstable; prone to overfitting; accuracy limited |
| Rule Lists (CORELS, BRL) | Ordered list of if-then rules; human-readable | Very interpretable; optimised for accuracy-interpretability balance | Computationally expensive to optimise |
| GAMs (Generalised Additive Models) | Additive model: each feature contributes independently | Per-feature shape functions are visualisable; non-linear but additive | Cannot capture feature interactions |
| EBM (Explainable Boosting Machine) | GAM trained with boosting; includes pairwise feature interactions | State-of-the-art accuracy among interpretable models; fast | Predefined interaction terms; not as flexible as deep nets |
| Attention-Based Models | Attention weights indicate which input tokens/regions were "attended to" | Built-in importance signal | Attention ≠ explanation (debated; attention can be misleading) |
| Method | How It Works | Output |
|---|---|---|
| SHAP (SHapley Additive exPlanations) | Computes Shapley values from cooperative game theory — each feature's marginal contribution to the prediction | Per-feature contribution scores; sum to prediction difference from baseline |
| LIME (Local Interpretable Model-agnostic Explanations) | Perturbs input; trains a local linear model in the neighbourhood of the prediction | Per-feature weights in the local linear model |
| Partial Dependence Plots (PDP) | Shows the marginal effect of one or two features on the predicted outcome | 1D or 2D plots showing feature response |
| Individual Conditional Expectation (ICE) | Like PDP but shows the effect for each instance, not the average | Per-instance feature response curves |
| Accumulated Local Effects (ALE) | Like PDP but avoids correlation bias; calculates local effects | Unbiased feature effects |
| Counterfactual Explanations | "What is the smallest change to the input that would change the prediction?" | Minimal change scenario |
| Anchors | Identifies sufficient conditions (rule-based) for a prediction | If-then rules that "anchor" the prediction |
| Feature Interaction Detection | Methods like H-statistic, SHAP interactions | Pairwise and higher-order feature interaction strengths |
| Method | Applies To | How It Works |
|---|---|---|
| Grad-CAM | CNNs (image models) | Uses gradients of the target class flowing into the final convolutional layer to produce a saliency map |
| Integrated Gradients | Differentiable models (neural nets) | Accumulates gradients along a path from a baseline input to the actual input |
| DeepLIFT | Neural networks | Compares activations to a reference and assigns contribution scores |
| Layer-wise Relevance Propagation (LRP) | Neural networks | Backpropagates relevance scores from output to input through network layers |
| Attention Visualisation | Transformers | Visualises attention weight matrices across heads and layers |
| Probing Classifiers | LLMs/Transformers | Trains a simple classifier on intermediate representations to test what information is encoded |
| Activation Maximisation | Neural networks | Synthesises an input that maximally activates a neuron or class — revealing what the model "looks for" |
| Circuit Analysis | Transformers | Maps computational circuits (paths through attention heads and MLPs) that implement specific behaviours |
The definitive toolkit for building explainable AI systems — from SHAP to cloud-native platforms.
| Tool | Creator | Key Capabilities |
|---|---|---|
| SHAP | Lundberg | Shapley values; KernelSHAP, TreeSHAP, DeepSHAP; theoretically grounded |
| LIME | Ribeiro et al. | Local perturbation-based; tabular, text, image |
| Captum | Meta / PyTorch | Comprehensive attribution; integrated gradients, SHAP, LRP |
| InterpretML | Microsoft | EBM + dashboard; glass-box models |
| Alibi | Seldon | Counterfactual explanations, anchors, trust scores |
| AIX360 | IBM | Prototypes, contrastive explanations, boolean rules |
| What-If Tool | Interactive visual; TensorBoard; fairness | |
| Responsible AI Toolbox | Microsoft | Error analysis, fairness, interpretation |
| SageMaker Clarify | AWS | Bias detection, SHAP-based attributions |
| Azure Responsible AI | Microsoft | Interpretability, fairness, error analysis |
| Library | Provider | Deployment | Highlights |
|---|---|---|---|
| SHAP | Open-source (Lundberg) | Open-Source (any OS; Python 3.8+; CPU; optional GPU for DeepSHAP) | The standard for Shapley-based explanations; KernelSHAP, TreeSHAP, DeepSHAP; Python |
| LIME | Open-source (Ribeiro et al.) | Open-Source (any OS; Python 3.8+; CPU-only) | Local explanations via perturbation; tabular, text, image support |
| Captum | Meta (open-source / PyTorch) | Open-Source (any OS; Python 3.8+; PyTorch; CPU or NVIDIA GPU) | Comprehensive attribution library for PyTorch models; integrated gradients, SHAP, LRP, GradCAM |
| InterpretML | Microsoft (open-source) | Open-Source (any OS; Python 3.8+; CPU-only) | EBM (Explainable Boosting Machine) + explanation dashboard; glass-box and black-box methods |
| Alibi | Seldon (open-source) | Open-Source (any OS; Python 3.8+; CPU; optional GPU for deep-learning explainers) | Counterfactual explanations, anchors, trust scores, integrated gradients |
| AI Explainability 360 (AIX360) | IBM (open-source) | Open-Source (any OS; Python 3.8+; CPU-only) | Comprehensive XAI toolkit; prototypes, contrastive explanations, boolean rules |
| tf-explain | Open-source (TensorFlow) | Open-Source (any OS; Python 3.8+; TensorFlow; CPU or GPU) | Grad-CAM, occlusion sensitivity, SmoothGrad for TensorFlow/Keras models |
| DiCE | Microsoft (open-source) | Open-Source (any OS; Python 3.8+; CPU-only) | Diverse Counterfactual Explanations for any ML model |
| TransformerLens | Neel Nanda (open-source) | Open-Source (any OS; Python 3.9+; PyTorch; NVIDIA GPU recommended for large models) | Mechanistic interpretability toolkit for Transformer models |
| SAE Lens | Joseph Bloom et al. (open-source) | Open-Source (any OS; Python 3.9+; PyTorch; NVIDIA GPU recommended) | Sparse autoencoder training and analysis for mechanistic interpretability |
| Tool | Provider | Deployment | Highlights |
|---|---|---|---|
| What-If Tool | Google (open-source) | Open-Source (any OS; Python 3.8+; TensorBoard integration; CPU-only) | Interactive visual tool for exploring model performance and fairness; integrates with TensorBoard |
| Responsible AI Toolbox | Microsoft (open-source) | Open-Source (any OS; Python 3.8+; CPU-only; integrates with Azure ML) | Error analysis, fairness assessment, model interpretation in a unified dashboard |
| Evidently | Open-source | Open-Source (any OS; Python 3.8+; CPU-only; Evidently Cloud SaaS on AWS available) | ML model monitoring with built-in explanation and drift detection |
| Neuron Viewer | Anthropic (public) | Cloud (Anthropic infrastructure) | Interactive visualisation of features discovered in LLMs via sparse autoencoders |
| LIT (Language Interpretability Tool) | Google (open-source) | Open-Source (any OS; Python 3.9+; CPU or GPU; browser-based UI) | Interactive analysis of NLP models; saliency, attention, counterfactual generation |
| Platform | Provider | Deployment | Highlights |
|---|---|---|---|
| Amazon SageMaker Clarify | AWS | Cloud (AWS — SageMaker on EC2; S3 for data) | Bias detection and feature attributions (SHAP-based) integrated into SageMaker |
| Google Cloud Vertex Explainable AI | Google Cloud | Cloud (GCP — Vertex AI on Compute Engine) | Feature attributions for Vertex AI models; supports tabular, image, and text |
| Azure Responsible AI | Microsoft | Cloud (Azure — Azure ML on Azure VMs) | Model interpretability, fairness, error analysis integrated into Azure ML |
| Fiddler AI | Fiddler | Cloud (Fiddler SaaS on AWS / GCP) | Model explainability, monitoring, and analytics platform |
| Arthur AI | Arthur | Cloud (Arthur SaaS on AWS) | Model monitoring with built-in explainability and bias detection |
Where Explainable AI delivers real-world value — from regulated finance to life-critical healthcare.
| Use Case | Description | Key Examples |
|---|---|---|
| Credit Decisions | Explain why a loan application was approved or denied; required by regulation | SHAP for feature importance; counterfactuals for "what would change the decision" |
| Fraud Detection | Explain why a transaction was flagged as fraudulent — enables analyst review | Feature attributions showing anomalous features |
| Algorithmic Trading | Explain trade signals to portfolio managers and compliance | Factor decomposition, decision tree surrogates |
| Insurance Underwriting | Explain policy pricing and risk assessment | EBM for interpretable pricing models |
| Regulatory Compliance | Model risk management (SR 11-7, OCC guidelines) requires model documentation and explanation | Model cards, SHAP reports, validation frameworks |
| Use Case | Description | Key Examples |
|---|---|---|
| Clinical Decision Support | Explain why an AI system recommended a diagnosis or treatment | Grad-CAM for radiology; SHAP for clinical risk models |
| Medical Imaging | Show which regions of a scan drove the AI's finding | Grad-CAM, saliency maps, attention heatmaps |
| Drug Discovery | Explain which molecular features predict bioactivity | SHAP on molecular descriptors; attention over molecular graphs |
| Patient Risk Stratification | Explain individual risk scores to clinicians | EBM models; SHAP waterfall plots |
| FDA/CE Clearance | Regulatory submissions increasingly require model explainability documentation | FDA AI/ML guidance; EU MDR requirements |
| Use Case | Description | Key Examples |
|---|---|---|
| Recidivism Prediction | Explain risk scores used in sentencing and parole decisions | Controversy around COMPAS; need for transparent models |
| Benefits Eligibility | Explain automated decisions on welfare, tax, and social services | GDPR / public sector requiring explanations for automated decisions |
| Policing | Explain predictive policing outputs and risk assessments | Transparency requirements; civil liberties concerns |
| Use Case | Description | Key Examples |
|---|---|---|
| Recommendation Systems | Explain why a product, video, or article was recommended | "Because you watched X", content-based explanations |
| Content Moderation | Explain why content was flagged or removed | Attribution of which elements violated policy |
| Search Ranking | Explain search result ordering to improve relevance and detect bias | Feature importance for ranking models |
| LLM Applications | Explain chatbot responses, summarisation, and generation decisions | CoT reasoning traces; retrieval attribution in RAG |
| Use Case | Description | Key Examples |
|---|---|---|
| Autonomous Vehicle Decisions | Explain perception and planning decisions for safety validation | Saliency maps for object detection; decision tree surrogates for planning |
| Quality Control | Explain defect detection decisions | Grad-CAM on inspection images |
| Predictive Maintenance | Explain equipment failure predictions to maintenance engineers | SHAP on sensor features; temporal attribution |
Quantitative measures of explanation quality and impact on human decision-making.
| Metric | What It Measures |
|---|---|
| Fidelity | How faithfully the explanation represents the model's actual reasoning — does the explanation predict the model's behaviour? |
| Stability / Robustness | Do similar inputs produce similar explanations? (Sensitive to small perturbations = unstable) |
| Sparsity / Conciseness | Does the explanation focus on a small number of key features? (Fewer features = easier to understand) |
| Sufficiency | Are the highlighted features enough to reproduce the prediction? (Mask non-highlighted features — does prediction hold?) |
| Necessity | Are the highlighted features necessary for the prediction? (Mask highlighted features — does prediction change?) |
| Plausibility | Do the explanations align with human intuition and domain knowledge? |
| Faithfulness | Does the explanation truly reflect the model's internal reasoning — or is it a plausible-but-unfaithful rationalisation? |
| Comprehensiveness | Are all important features captured — not just the top ones? |
| Metric | What It Measures |
|---|---|
| User Satisfaction | Do users find the explanation helpful and understandable? (Survey / Likert scale) |
| Task Performance | Do explanations improve human decision-making accuracy? (Measured via controlled experiments) |
| Trust Calibration | Do explanations help users correctly distinguish reliable vs. unreliable model predictions? |
| Cognitive Load | How much mental effort is required to understand the explanation? Lower = better |
| Actionability | Can users take meaningful action based on the explanation? (Especially for counterfactuals) |
| Metric | What It Measures |
|---|---|
| Explanation Time | Wall-clock time to generate an explanation (important for real-time applications) |
| Model Evaluations | Number of forward passes required (perturbation-based methods like LIME, KernelSHAP) |
| Memory Overhead | Additional memory required for explanation computation |
| Scalability | How explanation cost scales with input dimensionality and model size |
XAI market size, adoption metrics, and projected growth through 2028.
| Metric | Value | Source / Notes |
|---|---|---|
| XAI Market Size (2024) | ~$6.2 billion | MarketsandMarkets; includes software, services, platforms |
| Projected XAI Market (2028) | ~$16.2 billion | ~21% CAGR; driven primarily by regulation |
| Organisations Using XAI (2024) | ~35% of organisations deploying AI have some XAI capability | Gartner; mostly in financial services and healthcare |
| Regulatory-Driven Adoption | ~60% of XAI adoption is directly driven by regulatory requirements | EY survey 2024 |
| XAI in Healthcare | ~45% of AI-enabled clinical tools include some form of explanation | FDA AI submissions increasingly require it |
| Trend | Description |
|---|---|
| EU AI Act Forcing Function | The EU AI Act's transparency requirements are the single largest driver of XAI investment |
| SHAP Dominance | SHAP is the most widely adopted explanation method in industry; used in ~65% of XAI deployments |
| Interpretable Models Resurgence | EBMs and GAMs gaining traction as alternatives to black-box + post-hoc explanation |
| LLM Explainability Demand | Rapidly growing demand for explaining LLM behaviour; mechanistic interpretability funding increasing |
| Integrated Platforms | Major cloud providers (AWS, Azure, GCP) now bundle XAI with their ML services |
| Explanation UX Maturation | Explanations becoming better integrated into end-user workflows (not just data science tools) |
| Mechanistic Interpretability Surge | Anthropic, DeepMind, and EleutherAI investing heavily in understanding model internals |
| Counterfactual Explanations Growing | Counterfactuals gaining popularity for consumer-facing and compliance use cases |
Key risks and open challenges in the Explainable AI ecosystem.
| Limitation | Description |
|---|---|
| Accuracy-Interpretability Tradeoff | Inherently interpretable models are often less accurate than complex ones; post-hoc methods approximate but may not perfectly capture the model's reasoning |
| Faithfulness Gap | Post-hoc explanations may not faithfully represent the model's actual internal process — they may be plausible but wrong |
| Explanation Gaming | Models can be designed to produce favourable explanations while hiding discriminatory behaviour ("fairwashing") |
| Cognitive Overload | Too much information (100 features, complex interactions) can overwhelm users rather than help them |
| Explanation Disagreement | Different XAI methods applied to the same model and instance often produce different explanations |
| Illusion of Understanding | Explanations can create false confidence — users may over-trust a model because it "explained" its reasoning |
| Scalability | Many XAI methods are computationally expensive for large models (LLMs, large ensembles) |
| No Ground-Truth | There is no objective "correct" explanation — evaluation ultimately requires human judgement |
| Failure | Description |
|---|---|
| Unstable Explanations | LIME and some SHAP variants can produce different explanations for the same input across runs |
| Misleading Saliency Maps | Grad-CAM can highlight irrelevant regions (e.g., background instead of the object) |
| Unfaithful CoT | LLMs can generate plausible reasoning that does not reflect their actual computation |
| Confirmation Bias | Users may selectively accept explanations that confirm their prior beliefs |
| Feature Attribution Leakage | Correlated features can cause attributions to shift between correlated features unpredictably |
| Adversarial Explanations | Attackers can craft inputs that produce misleading explanations while achieving a target prediction |
| Criterion | Why XAI Excels |
|---|---|
| High-Stakes Decisions | When AI decisions have significant consequences (healthcare, finance, justice) |
| Regulatory Requirements | When explainability is legally mandated (EU AI Act, GDPR, sector-specific rules) |
| Debugging & Development | When data scientists need to understand and improve model behaviour |
| Fairness Auditing | When organisations need to verify AI systems are not discriminating |
| User Trust | When end users need to understand and trust AI recommendations |
| Safety-Critical Systems | When understanding failure modes is essential for safe deployment |
Explore how this system type connects to others in the AI landscape:
Analytical AI Predictive / Discriminative AI Bayesian / Probabilistic AI Cognitive / Neuro-Symbolic AI Federated / Privacy-Preserving AIKey terms and concepts in Explainable AI.
| Term | Definition |
|---|---|
| Activation Maximisation | Synthesising an input that maximally activates a specific neuron, revealing what feature it detects |
| ALE (Accumulated Local Effects) | An unbiased alternative to PDP that avoids correlation problems when plotting feature effects |
| Anchors | Sufficient conditions (if-then rules) that "anchor" a model's prediction in a local region |
| Attribution | The process of assigning credit or blame to input features for a model's prediction |
| Black-Box | A model whose internals are not accessible — only inputs and outputs can be observed |
| Captum | Meta's PyTorch-based library for model attribution and interpretability |
| Chain-of-Thought (CoT) | Prompting an LLM to generate step-by-step reasoning before producing an answer |
| Circuit Analysis | Reverse-engineering computational circuits in neural networks to understand information processing |
| Concept Activation Vector (CAV) | A direction in a model's activation space corresponding to a human-defined concept |
| Counterfactual Explanation | The smallest change to an input that would change the model's prediction |
| DeepLIFT | A neural network attribution method that compares activations to a reference input |
| EBM (Explainable Boosting Machine) | A glass-box model combining GAMs with gradient boosting; state-of-the-art interpretable model |
| Faithfulness | Whether an explanation truly reflects the model's actual internal reasoning process |
| Feature Attribution | A numerical score indicating how much each input feature contributed to a prediction |
| Fidelity | How accurately an explanation represents or predicts the model's behaviour |
| GAM (Generalised Additive Model) | A model where the prediction is a sum of individual feature effects, each modelled non-linearly |
| Glass-Box Model | A model that is inherently interpretable — its internal logic can be directly inspected |
| Grad-CAM | Gradient-weighted Class Activation Mapping — produces saliency maps for CNN predictions |
| Integrated Gradients | An attribution method that accumulates gradients along a path from a baseline to the input |
| Interpretability | The degree to which a model's behaviour can be understood by humans without additional tools |
| LIME | Local Interpretable Model-agnostic Explanations — local perturbation-based explanations |
| Logit Lens | A technique that projects intermediate Transformer hidden states to the vocabulary to trace token predictions |
| LRP (Layer-wise Relevance Propagation) | Backpropagates relevance scores from output to input through neural network layers |
| Mechanistic Interpretability | The research programme to reverse-engineer neural network computations at a mechanistic level |
| Model Card | A structured document describing a model's purpose, performance, limitations, and ethical considerations |
| PDP (Partial Dependence Plot) | A plot showing the marginal effect of one or two features on the model's predicted outcome |
| Post-Hoc Explanation | An explanation applied after a model is trained — it explains the model without changing it |
| Probing Classifier | A simple classifier trained on intermediate model representations to discover encoded information |
| Saliency Map | A visualisation showing which input regions (pixels, tokens) most influenced a model's prediction |
| SHAP | SHapley Additive exPlanations — theoretically grounded attribution method based on Shapley values |
| Shapley Value | From game theory: the unique fair allocation of a cooperative game's total payout among players |
| Sparse Autoencoder (SAE) | An autoencoder that learns sparse, interpretable features from dense neural network activations |
| Sufficiency | Whether the highlighted features are enough to reproduce the model's prediction |
| TCAV | Testing with Concept Activation Vectors — tests model sensitivity to human-defined concepts |
| Transparency | The overall property of an AI system being open, understandable, and inspectable |
| White-Box | A model whose internals (weights, activations, gradients) are fully accessible for inspection |
Animation infographics for Explainable AI (XAI) — overview and full technology stack.
Animation overview · Explainable AI (XAI) · 2026
Animation tech stack · Hardware → Compute → Data → Frameworks → Orchestration → Serving → Application · 2026
Detailed reference content for regulation.
| Regulation | XAI Relevance |
|---|---|
| EU AI Act (2024) | High-risk AI systems must provide "sufficient transparency to enable users to interpret the system's output and use it appropriately"; requires technical documentation of model behaviour |
| GDPR (2018) — Article 22 | Data subjects have the right not to be subject to solely automated decisions; organisations must provide "meaningful information about the logic involved" |
| ECOA / Reg B (US) | Creditors must provide "specific reasons" for adverse credit actions — directly requires explainable credit models |
| SR 11-7 (US Fed) | Model Risk Management guidance — requires model documentation, validation, and explanation |
| FDA AI/ML Guidance (US) | AI in medical devices requires transparency about model behaviour and decision-making |
| EU MDR (Medical Devices) | Clinical decision support software must be transparent and understandable |
| UK FCA / PRA | Financial regulators require firms to explain algorithmic decisions and demonstrate model governance |
| Singapore MAS FEAT | Fairness, Ethics, Accountability, and Transparency framework for AI in financial services |
| Practice | Description |
|---|---|
| Model Cards | Structured documentation of model purpose, performance, limitations, and intended use (Mitchell et al., 2019) |
| Datasheets for Datasets | Documentation of dataset provenance, composition, collection process, and intended use (Gebru et al., 2021) |
| Explanation Logging | Store explanations alongside predictions for audit and compliance |
| Explanation Review Process | Human review of explanations for high-stakes decisions |
| Regular Explanation Audits | Periodic assessment of explanation fidelity, stability, and alignment with domain knowledge |
| Fairness-Explainability Integration | Use XAI to identify and mitigate bias; SHAP for protected attribute analysis |
| Tiered Explanations | Provide different explanation depths for different audiences (consumer, business, technical, regulatory) |
Detailed reference content for deep dives.
| Aspect | Detail |
|---|---|
| Foundation | Shapley values from cooperative game theory (Shapley, 1953); applied to ML by Lundberg & Lee (2017) |
| Core Idea | Each feature is a "player" in a cooperative game; the prediction is the "payout"; Shapley values assign each player a fair contribution |
| Mathematical Property | The only method satisfying local accuracy, missingness, and consistency — axiomatically the "fairest" attribution |
| Variants | KernelSHAP: model-agnostic, perturbation-based — TreeSHAP: exact for tree models (O(TLD²)) — DeepSHAP: DeepLIFT + Shapley for neural nets — FastSHAP: amortised SHAP via a learned explainer network |
| Global Explanations | Aggregate local SHAP values across the dataset: mean absolute SHAP = global feature importance |
| Interaction Values | SHAP interaction values decompose Shapley values into main and interaction effects |
| Strengths | Theoretically grounded; consistent; both local and global; works on any model |
| Limitations | Computationally expensive for large models (exponential in exact form; KernelSHAP is approximate); baseline choice matters; can be slow for real-time applications |
| Aspect | Detail |
|---|---|
| Introduced | Ribeiro et al. (2016) |
| Core Idea | For a given prediction, perturb the input to generate a local neighbourhood; train a simple interpretable model (linear, decision tree) on the perturbation-prediction pairs; the interpretable model explains the local behaviour |
| Process | 1. Select instance → 2. Perturb input → 3. Get model predictions for perturbations → 4. Weight by proximity → 5. Fit sparse linear model → 6. Report coefficients as explanations |
| Strengths | Model-agnostic; intuitive; flexible — works on tabular, text, and image data |
| Limitations | Explanations can be unstable (different runs → different explanations); fidelity to the original model varies; neighbourhood definition is subjective |
| Aspect | Detail |
|---|---|
| Core Idea | "What is the smallest change to the input that would result in a different prediction?" |
| Example | "Your loan was denied. If your income were £5,000 higher and you had no outstanding debts, the loan would have been approved." |
| Strengths | Actionable — tells users what to change; intuitive; does not require model internals |
| Limitations | Multiple valid counterfactuals may exist; some changes may be infeasible (cannot change age); need to constrain to plausible changes |
| Methods | Wachter et al. (2017); DiCE (Diverse Counterfactual Explanations, Microsoft); FACE (Feasible Actionable Counterfactual Explanations) |
| Legal Relevance | GDPR's Right to Explanation has been interpreted to require counterfactual-style explanations |
| Aspect | Detail |
|---|---|
| Applies To | Convolutional Neural Networks (CNNs) for image tasks |
| Core Idea | Compute gradients of the target class score with respect to the feature maps of the last convolutional layer; use the gradient magnitude as weights to produce a heatmap |
| Output | A coarse spatial heatmap highlighting the regions of the input image that were most important for the prediction |
| Variants | Grad-CAM++ (improved multi-object localisation), Score-CAM (gradient-free), Layer-CAM (finer spatial resolution) |
| Strengths | Fast; no retraining; easy to visualise; widely adopted in medical imaging |
| Limitations | Coarse resolution (limited to the last conv layer); may miss fine-grained features; class-specific only |
| Aspect | Detail |
|---|---|
| Introduced | Sundararajan et al. (2017) |
| Core Idea | Accumulate gradients along a straight-line path from a baseline input (e.g., black image, zero vector) to the actual input |
| Mathematical Property | Satisfies sensitivity (if a feature changes the prediction, it gets non-zero attribution) and implementation invariance (same function → same attributions) |
| Strengths | Theoretically grounded; works on any differentiable model; no retraining |
| Limitations | Baseline choice affects results; path integration requires many steps (100+); can be noisy for high-dimensional inputs |
| Aspect | Detail |
|---|---|
| What It Is | A research approach aiming to reverse-engineer the computational mechanisms (circuits) inside neural networks — understanding not just what features matter but how the model processes information internally |
| Key Techniques | Activation patching, causal tracing, sparse autoencoders for feature discovery, circuit identification |
| Notable Work | Anthropic's "Towards Monosemanticity" (2023), "Scaling Monosemanticity" (2024); Neel Nanda's TransformerLens; Chris Olah's "Zoom In" |
| Goal | Move from "what features are important" to "how does the model compute its answer" — true understanding of model internals |
| Maturity | Research-stage; most work on small Transformers; scaling to production LLMs is an active frontier |
| Significance | If successful, mechanistic interpretability could provide definitive answers about model safety, bias, and behaviour — far beyond attribution methods |
| Aspect | Detail |
|---|---|
| What It Is | Explanations at the level of human-meaningful concepts (e.g., "stripes", "wings", "loop shape") rather than raw pixel or feature values |
| TCAV | Testing with Concept Activation Vectors (Kim et al., 2018) — tests how sensitive a model's predictions are to the presence of a human-defined concept |
| How It Works | Train a linear classifier to separate activations corresponding to a concept from random activations; use the classifier's direction as a "concept vector" |
| Strengths | Human-meaningful; bridges the gap between model internals and human understanding |
| Limitations | Requires labelled concept datasets; concept definitions can be ambiguous |
| Challenge | Description |
|---|---|
| Scale | LLMs have billions of parameters; traditional attribution methods are computationally prohibitive |
| Generative Output | LLMs generate sequences, not single predictions — explaining why each token was generated is complex |
| Emergent Behaviour | Capabilities emerge at scale (in-context learning, reasoning) that are not present in smaller models |
| Multimodal Inputs | Foundation models increasingly handle text, images, and audio — explanation must span modalities |
| Prompt Sensitivity | Small changes in prompts can dramatically change outputs — explanations must account for prompt influence |
| Black-Box API Access | Many LLMs are available only via API — no access to weights, gradients, or activations |
| Approach | Description |
|---|---|
| Chain-of-Thought (CoT) | Prompting the model to "explain its reasoning" step-by-step — generates a reasoning trace before the answer |
| Self-Consistency | Sampling multiple CoT reasoning paths and checking agreement — higher consistency suggests more reliable reasoning |
| Attention Visualisation | Visualising attention patterns across heads and layers; useful for understanding token dependencies |
| Probing | Training simple classifiers on intermediate representations to discover what information is encoded at each layer |
| Mechanistic Interpretability | Reverse-engineering circuits and features inside Transformer models (Anthropic, EleutherAI, DeepMind) |
| Logit Lens / Tuned Lens | Projecting intermediate hidden states to the vocabulary to trace how the model's "opinion" evolves layer by layer |
| Sparse Autoencoders (SAEs) | Decomposing neuron activations into interpretable features using sparse dictionaries — Anthropic's monosemanticity research |
| Retrieval Attribution | For RAG systems: showing which retrieved documents influenced the response |
| Faithfulness Evaluation | Testing whether CoT explanations actually reflect the model's computation — or are post-hoc rationalisations |
Critical Caveat: CoT explanations may not be faithful. Recent research (Turpin et al., 2023; Lanham et al., 2023) shows that LLMs' self-explanations often do not accurately reflect the true factors driving their predictions. Chain-of-thought is a useful tool but should not be treated as ground-truth explanation.
Detailed reference content for overview.
Explainable AI (XAI) is the set of methods, techniques, and design principles that enable humans to understand why an AI system made a particular decision, how it arrived at that decision, and what factors influenced it. XAI bridges the gap between the predictive power of complex AI models and the human need for transparency, trust, and accountability.
The fundamental tension in modern AI is the accuracy-interpretability trade-off: the most accurate models (deep neural networks, large ensembles, LLMs) are often the least interpretable, while the most interpretable models (linear regression, decision trees) are often less powerful. XAI exists to resolve this tension — either by designing inherently interpretable models or by building post-hoc explanation tools around opaque ones.
XAI is not a standalone AI type — like privacy-preserving AI, it is a cross-cutting discipline that applies to any AI system. You can explain a predictive model, a generative model, a recommender system, or an autonomous agent. The techniques differ, but the goal is the same: make AI understandable to humans.
| Dimension | Detail |
|---|---|
| Core Capability | Makes AI decisions understandable, interpretable, and transparent to humans |
| How It Works | Inherently interpretable models, post-hoc explanation methods (SHAP, LIME, attention, gradients), counterfactual analysis, concept-based explanations |
| What It Produces | Feature importance scores, attribution maps, counterfactual explanations, concept-level reasoning, natural language explanations |
| Key Differentiator | Does not replace AI models — augments them with human-understandable explanations and accountability |
| AI Type | What It Does | Example |
|---|---|---|
| Explainable AI (XAI) | Makes AI decisions understandable to humans | SHAP explaining a loan denial |
| Agentic AI | Pursues goals autonomously with tools, memory, and planning | Research agent, coding agent |
| Analytical AI | Extracts insights from data | Anomaly detector, clustering |
| Autonomous AI (Non-Agentic) | Operates independently within fixed boundaries without human input | Autopilot, auto-scaling, algorithmic trading |
| Bayesian / Probabilistic AI | Reasons under uncertainty using probability distributions | Clinical trial analysis, A/B testing, risk modelling |
| Cognitive / Neuro-Symbolic AI | Combines neural learning with symbolic reasoning | LLM + knowledge graph, physics-informed neural net |
| Conversational AI | Manages multi-turn dialogue between humans and machines | Customer service chatbot, voice assistant |
| Evolutionary / Genetic AI | Optimises solutions through population-based search inspired by natural selection | Neural architecture search, logistics scheduling |
| Generative AI | Creates new content from learned patterns | LLM, image generator |
| Multimodal Perception AI | Fuses vision, language, audio, and other modalities | GPT-4o processing image + text, AV sensor fusion |
| Optimisation / Operations Research AI | Finds optimal solutions to constrained mathematical problems | Vehicle routing, supply chain planning, scheduling |
| Physical / Embodied AI | Acts in the physical world through sensors and actuators | Autonomous vehicle, robot arm, drone |
| Predictive / Discriminative AI | Classifies or forecasts from data | Fraud detection model |
| Privacy-Preserving AI | Trains and runs AI without exposing raw data | Federated learning, differential privacy |
| Reactive AI | Responds to current input with no memory or learning | Thermostat, ABS braking system |
| Recommendation / Retrieval AI | Surfaces relevant items from large catalogues based on user signals | Netflix suggestions, Google Search, Spotify playlists |
| Reinforcement Learning AI | Learns optimal behaviour from reward signals via trial and error | AlphaGo, robotic locomotion, RLHF |
| Scientific / Simulation AI | Solves scientific problems and models physical systems | AlphaFold, climate simulation, molecular dynamics |
| Symbolic / Rule-Based AI | Reasons over explicit rules and knowledge to derive conclusions | Medical expert system, legal reasoning engine |
Key Distinction: XAI Is a Lens, Not a Model. XAI is not a type of AI model — it is a set of techniques applied to AI models. Any model can be made more explainable; the question is how much interpretability is needed and what explanations are appropriate.
Key Distinction: Interpretability vs. Explainability. Interpretability means a model is inherently understandable by design (a decision tree, a linear model). Explainability means post-hoc techniques are applied to make an opaque model understandable. XAI encompasses both.
Key Distinction from Debugging. XAI overlaps with model debugging but extends beyond it. Debugging asks "why is the model wrong?" XAI asks "why did the model make this decision?" — for both correct and incorrect predictions.