Abstract
We introduce a Python library, called slisemap, that contains a supervised dimensionality reduction method that can be used for global explanation of black box regression or classification models. slisemap takes a data matrix and predictions from a black box model as input, and outputs a (typically) two-dimensional embedding, such that the black box model can be approximated, to a good fidelity, by the same interpretable white box model for points with similar embeddings. The library includes basic visualisation tools and extensive documentation, making it easy to get started and obtain useful insights. The slisemap library is published on GitHub and PyPI under an open source license.
Support by Academy of Finland (grants 320182, 346376) & Future Makers Program.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In our recent manuscript [3] we introduce an algorithm, slisemap, that extends [1, 2] and combines manifold visualization (e.g., [6,7,8]) with local, model-agnostic explanations of regression or classification models (see [5] for a review). The idea of the latter is to find an interpretable white box surrogate model that locally approximates a complex black box model for a given data point.
slisemap produces a non-linear embedding of the data into d dimensions (typically \(d=2\)), such that data points projected nearby can, with good fidelity, be explained by the same white box model. Each data point have an embedding and an associated white box model. Together the white box models and the visual embedding provide a global explanation of the black box model.
In this paper we describe a Python library, called slisemap, that implements the algorithm by the same name.
The slisemap library can be used by all who want to explore datasets or are interested in global explanations for complex black box models.
While there are plethora of software for manifold embeddings or local explanations, none exist that combine these two.
2 Problem Definition
Formally, input to slisemap is given as a dataset of n points \((\textbf{x}_1,\textbf{y}_1),\ldots ,(\textbf{x}_n,\textbf{y}_n)\), where the covariates are given by real vectors \(\textbf{x}_i\in {\mathbb {R}}^m\) and the responses \(\textbf{y}_i=f(\textbf{x}_i)\in {\mathbb {R}}^p\), where \(f:{\mathbb {R}}^m\rightarrow {\mathbb {R}}^p\) is a pre-trained black box regression or classification model that we wish to explain. For regression problems \(p=1\) and for classification problems p is the number of classes, where \(\mathbf{{y}}_i\) represents the predicted class probabilities.
We also need a type of easy-to-understand, white box, surrogate model, \(g_i:{\mathbb {R}}^m\rightarrow {\mathbb {R}}^p\), that we use to approximate the black box model f in the neighbourhood (as defined by the embedding) of the data point \(i\in \{1,\ldots ,n\}\). We collect the parameters of the white box models into a matrix \(\mathbf{{B}}\in {\mathbb {R}}^{n\times q}\) such that the ith row \(\mathbf{{B}}_{i\cdot }\) contains the parameters of the white box model \(g_i\). As \(g_i\) for regression problems we use a simple linear model and for classification problems a multinomial logistic regression. Additionally, the loss function \(l:{\mathbb {R}}^p\times {\mathbb {R}}^p\rightarrow {\mathbb {R}}_{\ge 0}\) quantifies the mismatch between the black box and white box models. We use quadratic loss for regression problems and Hellinger loss (which is related to log-loss) for classification problems. Formally, the slisemap algorithm finds an embedding of a given radius by solving the following computational problem.
Problem 1
[3] Given the definitions above, regularization parameters \(\lambda _{lasso}\ge 0\) and \(\lambda _{ridge}\ge 0\), and the radius of the embedding \(z_{radius}>0\), find the parameters \(\mathbf{{B}}\in {\mathbb {R}}^{n\times q}\) and embedding of data points \(\textbf{Z}\in {\mathbb {R}}^{n\times d}\) that minimise the loss given by \(\mathcal{L}= \sum \nolimits _{i=1}^n{\sum \nolimits _{j=1}^n{\textbf{W}_{ij}{} \textbf{L}_{ij}}} +\sum \nolimits _{i=1}^n{\sum \nolimits _{j=1}^q{\left( \lambda _{lasso}|\mathbf{{B}}_{ij}|+ \lambda _{ridge}{} \mathbf{{B}}_{ij}^2 \right) }}\), where \(\textbf{L}_{ij}=l(g_i(\textbf{x}_j),\textbf{y}_j)\), \(\textbf{W}_{ij}=e^{-\textbf{D}_{ij}}/\sum \nolimits _{k=1}^n{e^{-\textbf{D}_{ik}}}\), and \(\textbf{D}_{ij}=(\sum \nolimits _{k=1}^d{(\textbf{Z}_{ik}-\textbf{Z}_{jk})^2})^{1/2}\), with the constraint that \((\sum \nolimits _{i=1}^n{\sum \nolimits _{k=1}^d{\textbf{Z}_{ik}^2/n}})^{1/2} = z_{radius}\).
This means that the local models are optimised using weights. The weights are based on distances between the data points in the embedding. Incompatible local models are, thus, pushed away from each other. Conversely, the constraint on the embedding size leads to interchangable local models forming clusters.
We refer to [3] for a detailed summary of related work, description and analysis of the algorithm, as well as experimental validation.
3 The Slisemap Library
slisemap is implemented in Python using PyTorch for the optimisation, enabling automatic differentiation and optional GPU-acceleration. However, the library also interfaces with standard Numpy. For the built-in visualisation, exploration, and diagnostics tools we use Seaborn.
The design goals of the library are flexibility, performance, and ease of use. This is accomplished through optional parameters, closures, and just-in-time compilation, while providing extensive documentation, sane defaults, and helpful warning messages.
The slisemap library is open source and available under an MIT license at https://github.com/edahelsinki/slisemap. The repository also includes a demonstration video and an extended version of the example discussed below in the form of a Jupyter notebook. The package can also be installed using pip install slisemap.
4 Usage Example
The autompg dataset [4] is a multivariate real-valued dataset with eight attributes describing the properties of 398 distinct cars (6 rows with missing values removed). The covariates are in a (normalised) Numpy array X, that consists of seven ordinal attributes for each car. The response vector y contains the fuel consumption (miles per gallon), as estimated by a random forest regressor. Code 1 shows how we apply slisemap on this dataset.
We make the interpretation of the local models easier by clustering (using k-means) the local model coefficients (rows of matrix \(\mathbf{{B}}\)) and colour-code the embedding based on the cluster indices. Furthermore, we add some jitter (since some points are on top of each other), and show only the five most meaningful attributes.
The result is shown in Fig. 1.
We can now identify which attributes in a given cluster are the most important in getting the predictions correct. For example, model year is an important indicator of fuel economy for cluster 0, but it is less important in cluster 3. Further analysis of the clusters reveals that cluster 3 consists of mostly heavy, U.S.-made cars with poor fuel economy, where the weight is the primary determinant for fuel consumption. On the other hand, cluster 0 has primarily non-U.S. cars, which are, on average, newer and lighter. Here horsepower is also an important attribute in predicting fuel consumption.
After optimising an embedding and finding local models with slisemap, it is possible to investigate, with a built-in command, how new data items would be projected onto the same embedding and what their local white box models would be. This is useful for faster embedding of large datasets (using subsampling) or to detect concept drift. Also, the same command can highlight alternative explanations (locations in the embedding) for existing data points.
Classification. To use slisemap for classification tasks we only have to replace the white box model (local_model in Table 1) with a classifier, such as logistic regression (included in the library). Alternatively we can transform the predictions of a black box model from [0, 1] to \([-\infty ,\infty ]\) with a logit transformation, \(y' = \log (y / (1 - y))\), and use linear regression for the approximation. A classification example on a larger dataset is also included in the GitHub repository (https://github.com/edahelsinki/slisemap).
References
Björklund, A., Henelius, A., Oikarinen, E., Kallonen, K., Puolamäki, K.: Sparse robust regression for explaining classifiers. In: Discovery Science, vol. 11828, pp. 351–366 (2019)
Björklund, A., Henelius, A., Oikarinen, E., Kallonen, K., Puolamäki, K.: Robust regression via error tolerance. Data Min. Knowl. Discov. 36, 781–810 (2022)
Björklund, A., Mäkelä, J., Puolamäki, K.: SLISEMAP: Supervised dimensionality reduction through local explanations. Mach. Learn. 112(1), 1–43 (2023). https://doi.org/10.1007/s10994-022-06261-1
Dua, D., Graff, C.: UCI machine learning repository (2017)
Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., Pedreschi, D.: A survey of methods for explaining black box models. ACM Comput. Surv. 51(5), 1–42 (2019)
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(86), 2579–2605 (2008)
McInnes, L., Healy, J., Melville, J.: UMAP: uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 [cs, stat] (2020)
Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this paper
Cite this paper
Björklund, A., Mäkelä, J., Puolamäki, K. (2023). SLISEMAP: Combining Supervised Dimensionality Reduction with Local Explanations. In: Amini, MR., Canu, S., Fischer, A., Guns, T., Kralj Novak, P., Tsoumakas, G. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2022. Lecture Notes in Computer Science(), vol 13718. Springer, Cham. https://doi.org/10.1007/978-3-031-26422-1_41
Download citation
DOI: https://doi.org/10.1007/978-3-031-26422-1_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26421-4
Online ISBN: 978-3-031-26422-1
eBook Packages: Computer ScienceComputer Science (R0)