XAI: A Growing Curriculum

The field of explainable AI is expanding rapidly and the course material grows alongside it. Beyond the foundational methods covered above, the curriculum continues to develop to include newer techniques and the theoretical connections between them, from gradient-based attribution methods and concept-based explanations to mechanistic interpretability and the formal study of what it means for a model to be understood. The image below gives an overview of the current topic coverage.

Overview of XAI course topics

LIME: Local interpretable model-agnostic explanations

LIME explains a prediction by replacing the behaviour of a complex black box model with a locally interpretable surrogate model. It is local in the sense that explanations are provided on a per instance basis, and model agnostic since it can be applied to any model. Intuitively, this method generates an artificial dataset by perturbing the instance in question. The idea is simple: although the global structure of the decision boundary might be complex, locally the boundary can often be approximated by a simpler (possibly linear) boundary. A model of this type has readily interpretable weights. LIME is one of the few methods that can be applied across data modalities.

Procedure: Take a single data instance $x \in X$ that you want to explain and transform it into an interpretable domain $x \to x^\prime$. Here $x^\prime \in \{0,1\}^{d^\prime}$ is a binary vector, where the interpretable quanta are either present or missing (the idea of missingness is central to XAI methods). Obtain a binary dataset $\{z^\prime\}$ by randomly sampling from $\{0,1\}^{d^\prime}$. This sample is then transformed back into the original data representation to give $\{z\}$. Act on the sample with the black box model $f_\theta | z\to y$ to generate a labeled dataset $\{z,f_\theta(z)\}$. Then search for an interpretable model $g \in G$ that acts on the interpretable domain $\{0,1\}^{d^\prime}$, such that the local fidelity between the surrogate simple model is close to that of the actual model operating in the original domain. The highly flexible training objective is given by $$ \xi(x)=\underset{g \in G}{\operatorname{argmin}} ~\mathcal{L}\left(f, g, \pi_x\right)+\Omega(g). $$ Here the fidelity is controlled by some loss function $\mathcal{L}$ that is made local by incorporating a proximity measure $\pi_x(z)$, while $\Omega(g)$ penalizes model complexity and is often human controlled. A concrete example of the loss function could be the squared loss $$ \mathcal{L}\left(f, g, \pi_x\right)=\sum_{z, z^{\prime} \in \mathcal{Z}} \underbrace{\exp \left(-\frac{D(x, z)^2}{\sigma^2}\right)}_{\pi_x(z)}\left[f_\theta(z)-g_\phi\left(z^{\prime}\right)\right]^2, $$ where the proximity measure $\pi_x(z)$ is an exponential kernel that places more weight on data that is similar to the point in question, i.e., only small perturbations and $D$ is some distance function (e.g., cosine distance for text, L2 distance for images). Note that this is critical since the random sampling is done in an uncontrolled manner.

Although LIME is applicable to every model, it suffers from a few drawbacks. 1) The choice of observable representation impacts the quality of the explanation. 2) The complexity of the surrogate model (number of parameters $k=|\phi|$) is a hyperparameter that has to be tuned and symbolizes the tension between fidelity and explainability. 3) The width $\sigma$ of the proximity measure is another hyperparameter and possibly the weakest feature of LIME since for complex points the size can flip the explanation. 4) Sampling from the binary representation is done with a flat distribution (ignoring feature dependence) and could therefore lead to perturbed datasets that do not faithfully represent the actual dataset, thus leading to poor decision boundaries. 5) Decision boundaries might not be manifolds and therefore not locally represented by a simple model. The LIME quanta for images are superpixels, which are generated by standard algorithms that partition the image into a "jigsaw" of patches that have similar visual qualities. For words, LIME uses a bag of words for its quanta, while for tabular data, it is not necessary to transport instances into interpretable representations; one can simply perturb each feature according to the distribution of values for said feature.